Joining your data
Join based on the metadata field _id
The metadata field _id
is supported as a join key field in semi-join queries.
Example
Having indexed the company and person indices and indexed documents, you can consider joining the two indices to fetch a list of companies whose employees are in the person
index by using the following request:
$ curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '{
"query" : {
"join" : {
"indices" : ["person"],
"on" : ["_id", "employed_by.keyword"], (1)
"request" : {
"query" : {
"match_all" : {}
}
}
}
}
}'
1 | The metadata field _id of the index company is used as the left join key field |
The response should contain two hits, as follows:
{
"hits": [
{
"_index": "company",
"_id": "1",
"_score": 1,
"_source": {
"id": 1,
"name": "Acme",
"ceo": "peo1",
"ceo_full_name": "Alice Wonder"
}
},
{
"_index": "company",
"_id": "2",
"_score": 1,
"_source": {
"id": 2,
"name": "Bueno"
}
}
]
}
Suppose that the two indices are joined in order to retrieve a list of companies using the following request:
$ curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '{
"query" : {
"join" : {
"indices" : ["person"],
"on" : ["ceo.keyword", "_id"], (1)
"request" : {
"query" : {
"match_all" : {}
}
}
}
}
}'
1 | The metadata field _id of the index person is used as the right join key field |
The response should contain one hit, as follows:
{
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [
{
"_index": "company",
"_id": "1",
"_score": 1,
"_source": {
"id": 1,
"name": "Acme",
"ceo": "peo1",
"ceo_full_name": "Alice Wonder"
}
}
]
}
}
Join based on a runtime field
Runtime fields can also be used as part of a join. They can be as part of the parent index or the child index, as shown in the following example:
$ curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '{
"query" : {
"join" : {
"indices" : ["person"],
"on" : ["rt_field_company", "rt_field_person"], (1)
"request" : {
"runtime_mappings": {
"rt_field_person": {
"type": "long",
"script": {
"source": "<put your script here>" (2)
}
}
},
"query" : {
"match_all" : {}
}
}
}
},
"runtime_mappings": {
"rt_field_company": {
"type": "long",
"script": {
"source": "<put your script here>" (3)
}
}
}
}'
1 | Join on two runtime fields defined in the request |
2 | Runtime field for person index |
3 | Runtime field for company index |
In this example, the join operation is performed on two runtime fields: rt_field_company
and rt_field_person
. The runtime mappings are defined within the request for both the parent (company) and child (person) indices. Replace <put your script here>
with the appropriate Painless script to compute the desired values for each runtime field.
Join based on a composite key
Runtime fields can be used to create composite keys, that can be used to join on multiple conditions. For example, if you have first_name
and last_name
fields, you often want to join on both fields at the same time. You can use runtime fields to concatenate multiple fields, and use this composite key as a join key.
Another use case example is if you have fields for IP address and network port, you can create a composite key <ip_address>:<port>
.
Like in the previous section, they can be defined for the parent index, the child index, or both. An example involving the company and person indices could look like this:
$ curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '
{
"query" : {
"join" : {
"indices" : ["person"],
"on" : ["ceo_full_name.keyword", "rt_full_name"], (1)
"request" : {
"runtime_mappings": {
"rt_full_name": { (2)
"type": "keyword",
"script": {
"source": "emit(doc['first_name.keyword'].value + ' ' + doc['last_name.keyword'].value)"
}
}
},
"query" : {
"match_all" : {}
}
}
}
}
}'
1 | Join between an indexed field and a runtime field |
2 | Runtime field for the person index, containing the concatenation of multiple fields |
In this example, a join operation is performed between the ceo_full_name
field of the company
index and a runtime field rt_full_name
. The runtime field is a composite key: a concatenation of the first_name
and last_name
of employees in the person
index.
This allows to join indices even when they don’t have directly corresponding keys.
Retrieving a projected field
You can use the fields
parameter to retrieve the values of a projected field for each hit.
For example, the two indices can be joined to retrieve a list of companies along with the ages of all their employees using the following request:
$ curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '
{
"query" : {
"join" : {
"indices" : ["person"],
"on" : ["_id", "employed_by.keyword"],
"request" : {
"project" : [
{ "field" : { "name" : "age", "alias" : "employees_age" } } (1)
],
"query" : {
"match_all" : {}
}
}
}
},
"fields" : ["employees_age"] (2)
}
'
1 | Projects the field age from index person as employees_age . |
2 | Returns the projected field employees_age using the fields parameter. |
The response contains two hits, one for each company, with the field employees_age
as follows:
{
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "company",
"_id": "1",
"_score": 1,
"_source": {
"id": 1,
"name": "Acme",
"ceo": "peo1",
"ceo_fullname": "Alice Wonder"
},
"fields": {
"employees_age": [
25,
50
]
}
},
{
"_index": "company",
"_id": "2",
"_score": 1,
"_source": {
"id": 2,
"name": "Bueno"
},
"fields": {
"employees_age": [
20
]
}
}
]
}
}
Retrieving a runtime field
The runtime field mapping defined in the search request allows creating a field that exists only as part of the query.
This new field can be projected like any other field by using the fields
parameter.
$ curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '
{
"query" : {
"join" : {
"indices" : ["person"],
"on" : ["_id", "employed_by.keyword"],
"request" : {
"runtime_mappings": {
"rt_field": {
"type": "long",
"script": {
"source": "<put your script here>" (1)
}
}
},
"project" : [
{ "field" : { "name" : "rt_field" } } (2)
],
"query" : {
"match_all" : {}
}
}
}
},
"fields" : ["rt_field"] (3)
}'
1 | The runtime field defined in the request |
2 | To project the new field |
3 | To fetch the new field into the response if needed |
In this example, a runtime field named rt_field
is created with a Painless script. Replace <put your script here>
with the appropriate script to compute the desired value for the runtime field. The new runtime field is projected and retrieved into the response using the fields
parameter.
Projecting and retrieving fields of a nested data type
Runtime fields can be used to project fields of a nested
data type. For instance, consider the following indices
$ curl -H 'Content-Type: application/json' -XPUT 'http://localhost:9200/teachers' -d '
{
"mappings": {
"properties": {
"id": { "type": "keyword" },
"classes": {"type": "keyword"}
}
}
}
'
$ curl -H 'Content-Type: application/json' -XPUT 'http://localhost:9200/classes' -d '
{
"mappings": {
"properties": {
"id": { "type": "keyword" },
"students": {
"type": "nested",
"properties": {
"name": {
"type": "keyword"
},
"surname": {
"type": "keyword"
}
}
}
}
}
}
'
As an example, we want to retrieve the full names of the teachers' students. These can be projected using runtime fields, and retrieved using the fields
parameter as follows
$ curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/teachers/_search?pretty' -d '{
"query": {
"join": {
"indices": [
"classes"
],
"on": [
"classes",
"id"
],
"request": {
"runtime_mappings": {
"fullname": {
"type": "keyword",
"script": {
"source": "for (s in params._source.students) { emit(s.name + ' ' + s.surname) }" (1)
}
}
},
"project": {
"field": {
"name": "fullname" (2)
}
},
"query": {
"match_all": {}
}
}
}
},
"fields": ["fullname"] (3)
}'
1 | The runtime field defined in the request |
2 | To project the new field |
3 | To fetch the new field into the response if needed |
Filter using a projected field
Runtime fields can also be used to filter on projected data.
$ curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '
{
"runtime_mappings": {
"employee_count": {
"type": "long",
"script": "emit(doc.employees.length)" (1)
}
},
"query": {
"bool": {
"must": [
{
"join": {
"indices": ["person"],
"on": ["_id", "employed_by.keyword"],
"request": {
"project": [
{
"field": {"name": "id", "alias": "employees"} (2)
}
],
"query": {
"match_all": {}
}
}
}
}
],
"filter": [
{
"range" : {
"employee_count" : { "gt" : 1 } (3)
}
}
]
}
}
}'
1 | The runtime field used to collect the projected fields |
2 | To project the id field for all matching documents in the person index |
3 | To filter the documents in the company index, using the value from the runtime field |
In this example (using the company and person indices), the runtime field employee_count
collects all the projected values from the person
index, and returns the number of values.
This runtime field is then used in the filter
clause to return only the companies that have more than 1 employee.
Sorting based on a projected field
A script-based sorting method can be used to sort the hits based on the values of a projected field, for example (using the company and person indices):
$ curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '{
"query" : {
"join" : {
"indices" : ["person"],
"on" : ["_id", "employed_by.keyword"],
"request" : {
"project" : [
{ "field" : { "name" : "age", "alias" : "employee_age" } }
],
"query" : {
"match_all" : {}
}
}
}
},
"runtime_mappings": {
"employee_ages": {
"type": "long",
"script": "int sum = 0; for (value in doc.employee_age) { sum += value } emit(sum);"
}
},
"sort": [
{
"employee_ages": "desc"
}
]
}'
The response should contain two hits, one for each company, sorted by the sum of their employees age as follows:
{
"hits": [
{
"_index": "company",
"_id": "1",
"_score": null,
"_source": {
"id": 1,
"name": "Acme",
"ceo": "peo1",
"ceo_fullname": "Alice Wonder"
},
"sort": [
75
]
},
{
"_index": "company",
"_id": "2",
"_score": null,
"_source": {
"id": 2,
"name": "Bueno"
},
"sort": [
20
]
}
]
}
In this example, the join operation is performed between the company
and person
indices, and the projected field employee_age
is created. The script-based sorting method is used to sort the hits based on the sum of employees' ages in the employee_ages
runtime field. The search results are sorted in descending order, with the company having the highest sum of employees' ages appearing first in the search results.
Spatial join using range fields
Federate enables spatial joins through the application of one or more spatial conditions on range fields. For example, documents with a date
field in the parent index can be joined with the child index when the date
is within the retention_dates
field of the child index; where retention_dates
is of type date_range
.
This join can be expressed as follows:
curl -H 'Content-Type: application/json' -XGET 'http://localhost:9200/siren/parent_index/_search' -d '
{
"query" : {
"join" : {
"indices" : ["child_index"],
"on" :
{
"must": [ (1)
{"within": {"fields": ["date", "retention_dates"] }} (2)
],
}
"request" : {
"query" : {
"match_all" : {}
}
}
}
}
}
'
1 | The join using range field data types. |
2 | Match documents where the date is within the ranges defined by retention_dates . |
In this example, a spatial join is performed between the parent_index
and child_index
based on the date
and retention_dates
fields. The join condition requires the parent index’s date
field to be within the range defined by the retention_dates
field.
If the range data type field is located on the parent_index
, use the contains
operator instead.
curl -H 'Content-Type: application/json' -XGET 'http://localhost:9200/siren/parent_index/_search' -d '
{
"query" : {
"join" : {
"indices" : ["child_index"],
"on" :
{
"must": [ (1)
{"contains": {"fields": ["retention_dates", "date"] }} (2)
],
}
"request" : {
"query" : {
"match_all" : {}
}
}
}
}
}
'
1 | The join using range field data types. |
2 | Match documents where the range in retention_dates contains the date value. |
For information about accepted data types in a spatial join, see Requirements for a join and the known limitations section for spatial joins. |
Spatial join using simple fields
In addition to range fields, Federate enables spatial joins through the application of one or more spatial conditions on fields with basic data types such as numeric or date. For example, documents with a date
field in the parent index can be joined with the child index where the date is within a start_date
field and a end_date
field of the child index. This translates to the Boolean expression date > start_date && date < end_date
.
This spatial join can be expressed as follows:
curl -H 'Content-Type: application/json' -XGET 'http://localhost:9200/siren/parent_index/_search' -d '
{
"query" : {
"join" : {
"indices" : ["child_index"],
"on" :
{
"must": [ (1)
{"gt": {"fields": ["date", "start_date"] }}, (2)
{"lt": {"fields": ["date", "end_date"] }} (3)
]
},
"request" : {
"query" : {
"match_all" : {}
}
}
}
}
}
'
1 | The join is the conjunction of 2 conditions. |
2 | Match documents where date is greater than start_date . |
3 | In addition, this matches documents where date is less than end_date . |
In this example, a spatial join is performed between the parent_index
and child_index
based on the date
, start_date
, and end_date
fields. The join condition requires the parent index’s date
field to be greater than the child index’s start_date
field and less than the child index’s end_date
field.
Instead of combining GT(E) and LT(E) conditions to join on ranges, you can use range data types and their specialized operators: |
Handling missing fields
Spatial join with range fields
natively support open-ended ranges. However, in scenarios where using range fields is not feasible, you can use the allow_missing_bound
parameter to interpret a missing value for a field (if the field is not set, or set to null
) as an open bound for the range.
The join operator in the previous example can be altered to also match documents missing a value for the end_date
field:
curl -H 'Content-Type: application/json' -XGET 'http://localhost:9200/siren/parent_index/_search' -d '
{
"query" : {
"join" : {
"indices" : ["child_index"],
"on" :
{
"must": [
{
"gt": {
"fields": ["date", "start_date"] (1)
}
},
{
"lt": {
"fields": ["date", "end_date"], (2)
"allow_missing_bound": "right" (3)
}
}
]
},
"request" : {
"query" : {
"match_all" : {}
}
}
}
}
}
'
1 | Match documents where date is greater than start_date . |
2 | Also match documents where date is less than end_date . |
3 | Also match documents where the right field (end_date ) is missing a value (null or missing). |
Compatible operators and accepted syntax are detailed in the
definition of allow_missing_bounds
.
Note that when performing spatial join operations on simple fields using the |
Scoring Capabilities
By default, the join
filter returns a constant score. Therefore, the scores of the matching documents from the child set
do not affect the scores of the matching documents from the parent set. However, you can
project the document’s score from the child set
and customize the scoring of the documents from the parent set with a
script score function.
Document Score
The score of a matching document from a set may be projected using a standard field object using
the special field name _score
.
{
"field" : {
"name" : "_score",
"alias" : "employee_score"
}
}
Scoring based on a projected field
A script-based scoring method can be used to customize the scoring based on the values of a projected field. For example, you can project the score of the matching documents from the child set and aggregate them into the parent document as follows:
$ curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '{
"query": {
"function_score": {
"query": {
"join": {
"indices": [ "person" ],
"on": [ "_id", "employed_by.keyword" ],
"request": {
"project" : [
{ "field" : { "name" : "_score", "alias" : "child_score" } }
],
"query": {
"match_all": {}
}
}
}
},
"functions": [
{
"script_score": {
"script": {
"lang": "painless",
"source": "float sum = 0; for (value in doc.child_score) { sum += value } return sum;"
}
}
}
],
"score_mode": "multiply",
"boost_mode": "replace"
}
}
}'
The response should contain two hits, one for each company, sorted by the sum of their child scores as follows:
{
"hits": [
{
"_index": "company",
"_id": "1",
"_score": 2,
"_source": {
"id": 1,
"name": "Acme",
"ceo": "peo1",
"ceo_fullname": "Alice Wonder"
}
},
{
"_index": "company",
"_id": "2",
"_score": 1,
"_source": {
"id": 2,
"name": "Bueno"
}
}
]
}
Aggregating based on a projected field
A projected field can be used as part of the aggregation. For example, when using the company and person indices, you can project the values of
the field age
of the matching documents from the person
index and aggregate the documents from the company
index by using these values as follows:
$ curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '
{
"query" : {
"join" : {
"indices" : ["person"],
"on" : ["_id", "employed_by.keyword"],
"request" : {
"project" : [
{ "field" : { "name" : "age", "alias" : "employee_ages" } }
],
"query" : {
"match_all" : {}
}
}
}
},
"aggs": {
"by_company": {
"terms": {
"field": "name.keyword"
},
"aggs": {
"average_employee_age": {
"avg": {
"field": "employee_ages"
}
}
}
}
}
}'
The response contains an aggregation result with a bucket for each company, and a sub-aggregation average_employee_age
corresponding to the average age of employees in this company.
{
"aggregations": {
"by_company": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Acme",
"doc_count": 1,
"average_employee_age": {
"value": 37.5
}
},
{
"key": "Bueno",
"doc_count": 1,
"average_employee_age": {
"value": 20
}
}
]
}
}
}
In this example, the query joins the company
index with the person
index using the id
and employed_by
fields. The query projects the age
field from the person
index using an alias employee_age
. The aggregation is performed using a runtime field employee_ages
, which iterates through the projected employee_age
field. The documents are first aggregated by company name, then a sub-aggregation average_employee_age
computes the average age of employees in each bucket.
Data example for company and person indices
This section will feature examples based on two indices, company
and person
.
The documents are indexed using the dynamic field mapping feature provided by Elasticsearch https://www.elastic.co/guide/en/elasticsearch/reference/8.12/dynamic-field-mapping.html.
The examples involving these indices are based on the following documents:
$ curl -H 'Content-Type: application/json' -XPUT 'http://localhost:9200/_bulk?pretty' -d '
{ "index" : { "_index" : "company", "_id" : "1" } }
{ "id": 1, "name" : "Acme", "ceo": "peo1", "ceo_full_name": "Alice Wonder" }
{ "index" : { "_index" : "company", "_id" : "2" } }
{ "id": 2, "name" : "Bueno"}
{ "index" : { "_index" : "company", "_id" : "3" } }
{ "id": 3, "name" : "Ark" }
{ "index" : { "_index" : "person", "_id" : "peo1" } }
{ "id" : 1, "first_name" : "Alice", "last_name":"Wonder", "employed_by" : "1", "age": 50 }
{ "index" : { "_index" : "person", "_id" : "peo2" } }
{ "id" : 2, "first_name" : "Bob", "last_name":"Dylan", "employed_by" : "2", "age": 20 }
{ "index" : { "_index" : "person", "_id" : "peo3" } }
{ "id" : 3, "first_name" : "Carol", "last_name":"Aird", "employed_by" :"1", "age": 25 }
'