Joining your data

Join based on the metadata field _id

The metadata field _id is supported as a join key field in semi-join queries.

Example

Having indexed the company and person indices and indexed documents, you can consider joining the two indices to fetch a list of companies whose employees are in the person index by using the following request:

$ curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '{
   "query" : {
      "join" : {
        "indices" : ["person"],
        "on" : ["_id", "employed_by.keyword"],  (1)
        "request" : {
          "query" : {
            "match_all" : {}
          }
        }
      }
    }
}'
1 The metadata field _id of the index company is used as the left join key field

The response should contain two hits, as follows:

{
    "hits": [
      {
        "_index": "company",
        "_id": "1",
        "_score": 1,
        "_source": {
          "id": 1,
          "name": "Acme",
          "ceo": "peo1",
          "ceo_full_name": "Alice Wonder"
        }
      },
      {
        "_index": "company",
        "_id": "2",
        "_score": 1,
        "_source": {
          "id": 2,
          "name": "Bueno"
        }
      }
    ]

}

Suppose that the two indices are joined in order to retrieve a list of companies using the following request:

$ curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '{
   "query" : {
      "join" : {
        "indices" : ["person"],
        "on" : ["ceo.keyword", "_id"],  (1)
        "request" : {
          "query" : {
            "match_all" : {}
          }
        }
      }
    }
}'
1 The metadata field _id of the index person is used as the right join key field

The response should contain one hit, as follows:

{
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index": "company",
        "_id": "1",
        "_score": 1,
        "_source": {
          "id": 1,
          "name": "Acme",
          "ceo": "peo1",
          "ceo_full_name": "Alice Wonder"
        }
      }
    ]
  }
}

Join based on a runtime field

Runtime fields can also be used as part of a join. They can be as part of the parent index or the child index, as shown in the following example:

$ curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '{
   "query" : {
      "join" : {
        "indices" : ["person"],
        "on" : ["rt_field_company", "rt_field_person"],         (1)
        "request" : {
            "runtime_mappings": {
                "rt_field_person": {
                  "type": "long",
                  "script": {
                    "source": "<put your script here>"          (2)
                  }
                }
              },
              "query" : {
                "match_all" : {}
              }
        }
      }
    },
    "runtime_mappings": {
        "rt_field_company": {
          "type": "long",
          "script": {
            "source": "<put your script here>"                   (3)
          }
        }
      }
}'
1 Join on two runtime fields defined in the request
2 Runtime field for person index
3 Runtime field for company index

In this example, the join operation is performed on two runtime fields: rt_field_company and rt_field_person. The runtime mappings are defined within the request for both the parent (company) and child (person) indices. Replace <put your script here> with the appropriate Painless script to compute the desired values for each runtime field.

Join based on a composite key

Runtime fields can be used to create composite keys, that can be used to join on multiple conditions. For example, if you have first_name and last_name fields, you often want to join on both fields at the same time. You can use runtime fields to concatenate multiple fields, and use this composite key as a join key. Another use case example is if you have fields for IP address and network port, you can create a composite key <ip_address>:<port>.

Like in the previous section, they can be defined for the parent index, the child index, or both. An example involving the company and person indices could look like this:

$ curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '
{
  "query" : {
    "join" : {
      "indices" : ["person"],
      "on" : ["ceo_full_name.keyword", "rt_full_name"],     (1)
      "request" : {
        "runtime_mappings": {
          "rt_full_name": {                                 (2)
            "type": "keyword",
            "script": {
              "source": "emit(doc['first_name.keyword'].value + ' ' + doc['last_name.keyword'].value)"
            }
          }
        },
        "query" : {
          "match_all" : {}
        }
      }
    }
  }
}'
1 Join between an indexed field and a runtime field
2 Runtime field for the person index, containing the concatenation of multiple fields

In this example, a join operation is performed between the ceo_full_name field of the company index and a runtime field rt_full_name. The runtime field is a composite key: a concatenation of the first_name and last_name of employees in the person index. This allows to join indices even when they don’t have directly corresponding keys.

Retrieving a projected field

You can use the fields parameter to retrieve the values of a projected field for each hit.

For example, the two indices can be joined to retrieve a list of companies along with the ages of all their employees using the following request:

$ curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '
{
   "query" : {
      "join" : {
        "indices" : ["person"],
        "on" : ["_id", "employed_by.keyword"],
        "request" : {
          "project" : [
            { "field" : { "name" : "age", "alias" : "employees_age" } }  (1)
          ],
          "query" : {
            "match_all" : {}
          }
        }
      }
    },
    "fields" : ["employees_age"]                                         (2)
}
'
1 Projects the field age from index person as employees_age.
2 Returns the projected field employees_age using the fields parameter.

The response contains two hits, one for each company, with the field employees_age as follows:

{
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "company",
        "_id": "1",
        "_score": 1,
        "_source": {
          "id": 1,
          "name": "Acme",
          "ceo": "peo1",
          "ceo_fullname": "Alice Wonder"
        },
        "fields": {
          "employees_age": [
            25,
            50
          ]
        }
      },
      {
        "_index": "company",
        "_id": "2",
        "_score": 1,
        "_source": {
          "id": 2,
          "name": "Bueno"
        },
        "fields": {
          "employees_age": [
            20
          ]
        }
      }
    ]
  }
}

Retrieving a runtime field

The runtime field mapping defined in the search request allows creating a field that exists only as part of the query. This new field can be projected like any other field by using the fields parameter.

$ curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '
{
   "query" : {
      "join" : {
        "indices" : ["person"],
        "on" : ["_id", "employed_by.keyword"],
        "request" : {
            "runtime_mappings": {
                "rt_field": {
                  "type": "long",
                  "script": {
                     "source": "<put your script here>"                   (1)
                  }
                }
              },
              "project" : [
                { "field" : { "name" : "rt_field" } }                     (2)
              ],
              "query" : {
                "match_all" : {}
              }
        }
      }
    },
    "fields" : ["rt_field"]                                               (3)
}'
1 The runtime field defined in the request
2 To project the new field
3 To fetch the new field into the response if needed

In this example, a runtime field named rt_field is created with a Painless script. Replace <put your script here> with the appropriate script to compute the desired value for the runtime field. The new runtime field is projected and retrieved into the response using the fields parameter.

Projecting and retrieving fields of a nested data type

Runtime fields can be used to project fields of a nested data type. For instance, consider the following indices

$ curl -H 'Content-Type: application/json' -XPUT 'http://localhost:9200/teachers' -d '
{
  "mappings": {
    "properties": {
      "id": { "type": "keyword" },
      "classes": {"type": "keyword"}
    }
  }
}
'

$ curl -H 'Content-Type: application/json' -XPUT 'http://localhost:9200/classes' -d '
{
  "mappings": {
    "properties": {
      "id": { "type": "keyword" },
      "students": {
        "type": "nested",
        "properties": {
          "name": {
            "type": "keyword"
          },
          "surname": {
            "type": "keyword"
          }
        }
      }
    }
  }
}
'

As an example, we want to retrieve the full names of the teachers' students. These can be projected using runtime fields, and retrieved using the fields parameter as follows

$ curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/teachers/_search?pretty' -d '{
  "query": {
    "join": {
      "indices": [
        "classes"
      ],
      "on": [
        "classes",
        "id"
      ],
      "request": {
        "runtime_mappings": {
          "fullname": {
            "type": "keyword",
            "script": {
              "source": "for (s in params._source.students) { emit(s.name + ' ' + s.surname) }" (1)
            }
          }
        },
        "project": {
          "field": {
            "name": "fullname"                                                                  (2)
          }
        },
        "query": {
          "match_all": {}
        }
      }
    }
  },
  "fields": ["fullname"]                                                                        (3)
}'
1 The runtime field defined in the request
2 To project the new field
3 To fetch the new field into the response if needed

Filter using a projected field

Runtime fields can also be used to filter on projected data.

$ curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '
{
  "runtime_mappings": {
    "employee_count": {
      "type": "long",
      "script": "emit(doc.employees.length)"                          (1)
    }
  },
  "query": {
    "bool": {
      "must": [
        {
          "join": {
            "indices": ["person"],
            "on": ["_id", "employed_by.keyword"],
            "request": {
              "project": [
                {
                  "field": {"name": "id", "alias": "employees"}       (2)
                }
              ],
              "query": {
                "match_all": {}
              }
            }
          }
        }
      ],
      "filter": [
        {
          "range" : {
            "employee_count" : { "gt" : 1 }                           (3)
          }
        }
      ]
    }
  }
}'
1 The runtime field used to collect the projected fields
2 To project the id field for all matching documents in the person index
3 To filter the documents in the company index, using the value from the runtime field

In this example (using the company and person indices), the runtime field employee_count collects all the projected values from the person index, and returns the number of values. This runtime field is then used in the filter clause to return only the companies that have more than 1 employee.

Sorting based on a projected field

A script-based sorting method can be used to sort the hits based on the values of a projected field, for example (using the company and person indices):

$ curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '{
   "query" : {
     "join" : {
       "indices" : ["person"],
       "on" : ["_id", "employed_by.keyword"],
       "request" : {
          "project" : [
            { "field" : { "name" : "age", "alias" : "employee_age" } }
          ],
         "query" : {
           "match_all" : {}
         }
       }
     }
   },
   "runtime_mappings": {
    "employee_ages": {
      "type": "long",
      "script": "int sum = 0; for (value in doc.employee_age) { sum += value } emit(sum);"
     }
   },
   "sort": [
    {
      "employee_ages": "desc"
    }
  ]
}'

The response should contain two hits, one for each company, sorted by the sum of their employees age as follows:

 {
    "hits": [
      {
        "_index": "company",
        "_id": "1",
        "_score": null,
        "_source": {
          "id": 1,
          "name": "Acme",
          "ceo": "peo1",
          "ceo_fullname": "Alice Wonder"
        },
        "sort": [
          75
        ]
      },
      {
        "_index": "company",
        "_id": "2",
        "_score": null,
        "_source": {
          "id": 2,
          "name": "Bueno"
        },
        "sort": [
          20
        ]
      }
    ]
}

In this example, the join operation is performed between the company and person indices, and the projected field employee_age is created. The script-based sorting method is used to sort the hits based on the sum of employees' ages in the employee_ages runtime field. The search results are sorted in descending order, with the company having the highest sum of employees' ages appearing first in the search results.

Spatial join using range fields

Federate enables spatial joins through the application of one or more spatial conditions on range fields. For example, documents with a date field in the parent index can be joined with the child index when the date is within the retention_dates field of the child index; where retention_dates is of type date_range.

This join can be expressed as follows:

curl -H 'Content-Type: application/json' -XGET 'http://localhost:9200/siren/parent_index/_search' -d '
{
  "query" : {
    "join" : {
      "indices" : ["child_index"],
      "on" :
        {
          "must": [ (1)
            {"within": {"fields": ["date", "retention_dates"] }} (2)
          ],
        }
        "request" : {
          "query" : {
            "match_all" : {}
          }
        }
    }
  }
}
'
1 The join using range field data types.
2 Match documents where the date is within the ranges defined by retention_dates.

In this example, a spatial join is performed between the parent_index and child_index based on the date and retention_dates fields. The join condition requires the parent index’s date field to be within the range defined by the retention_dates field.

If the range data type field is located on the parent_index, use the contains operator instead.

curl -H 'Content-Type: application/json' -XGET 'http://localhost:9200/siren/parent_index/_search' -d '
{
  "query" : {
    "join" : {
      "indices" : ["child_index"],
      "on" :
        {
          "must": [ (1)
            {"contains": {"fields": ["retention_dates", "date"] }} (2)
          ],
        }
        "request" : {
          "query" : {
            "match_all" : {}
          }
        }
    }
  }
}
'
1 The join using range field data types.
2 Match documents where the range in retention_dates contains the date value.

For information about accepted data types in a spatial join, see Requirements for a join and the known limitations section for spatial joins.

Spatial join using simple fields

In addition to range fields, Federate enables spatial joins through the application of one or more spatial conditions on fields with basic data types such as numeric or date. For example, documents with a date field in the parent index can be joined with the child index where the date is within a start_date field and a end_date field of the child index. This translates to the Boolean expression date > start_date && date < end_date.

This spatial join can be expressed as follows:

curl -H 'Content-Type: application/json' -XGET 'http://localhost:9200/siren/parent_index/_search' -d '
{
  "query" : {
    "join" : {
      "indices" : ["child_index"],
      "on" :
        {
          "must": [ (1)
            {"gt": {"fields": ["date", "start_date"] }}, (2)
            {"lt": {"fields": ["date", "end_date"] }} (3)
          ]
        },
        "request" : {
          "query" : {
            "match_all" : {}
          }
        }
    }
  }
}
'
1 The join is the conjunction of 2 conditions.
2 Match documents where date is greater than start_date.
3 In addition, this matches documents where date is less than end_date.

In this example, a spatial join is performed between the parent_index and child_index based on the date, start_date, and end_date fields. The join condition requires the parent index’s date field to be greater than the child index’s start_date field and less than the child index’s end_date field.

Instead of combining GT(E) and LT(E) conditions to join on ranges, you can use range data types and their specialized operators: contains, within, and intersect. Range data types natively support open-ended ranges, which can simplify your model. A range is considered open-ended in a range data type if it misses a value for one of its bounds – either the lower bound (gt or gte) or the upper bound (lt or lte).

Handling missing fields

Spatial join with range fields natively support open-ended ranges. However, in scenarios where using range fields is not feasible, you can use the allow_missing_bound parameter to interpret a missing value for a field (if the field is not set, or set to null) as an open bound for the range.

The join operator in the previous example can be altered to also match documents missing a value for the end_date field:

curl -H 'Content-Type: application/json' -XGET 'http://localhost:9200/siren/parent_index/_search' -d '
{
  "query" : {
    "join" : {
      "indices" : ["child_index"],
      "on" :
        {
          "must": [
            {
              "gt": {
                "fields": ["date", "start_date"] (1)
              }
            },
            {
              "lt": {
                "fields": ["date", "end_date"], (2)
                "allow_missing_bound": "right" (3)
              }
            }
          ]
        },
        "request" : {
          "query" : {
            "match_all" : {}
          }
        }
    }
  }
}
'
1 Match documents where date is greater than start_date.
2 Also match documents where date is less than end_date.
3 Also match documents where the right field (end_date) is missing a value (null or missing).

Compatible operators and accepted syntax are detailed in the definition of allow_missing_bounds.

Note that when performing spatial join operations on simple fields using the allow_missing_bound parameter, there might be a reduction in efficiency compared to operations using range fields. This is because Siren Federate requires additional computational steps to handle documents that are missing a value. It is important to consider this trade-off between flexibility and performance.

Scoring Capabilities

By default, the join filter returns a constant score. Therefore, the scores of the matching documents from the child set do not affect the scores of the matching documents from the parent set. However, you can project the document’s score from the child set and customize the scoring of the documents from the parent set with a script score function.

Document Score

The score of a matching document from a set may be projected using a standard field object using the special field name _score.

{
  "field" : {
    "name" : "_score",
    "alias" : "employee_score"
  }
}

Scoring based on a projected field

A script-based scoring method can be used to customize the scoring based on the values of a projected field. For example, you can project the score of the matching documents from the child set and aggregate them into the parent document as follows:

$ curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '{
  "query": {
    "function_score": {
      "query": {
        "join": {
          "indices": [ "person" ],
          "on": [ "_id", "employed_by.keyword" ],
          "request": {
            "project" : [
              { "field" : { "name" : "_score", "alias" : "child_score" } }
            ],
            "query": {
              "match_all": {}
            }
          }
        }
      },
      "functions": [
        {
          "script_score": {
            "script": {
              "lang": "painless",
              "source": "float sum = 0; for (value in doc.child_score) { sum += value } return sum;"
            }
          }
        }
      ],
      "score_mode": "multiply",
      "boost_mode": "replace"
    }
  }
}'

The response should contain two hits, one for each company, sorted by the sum of their child scores as follows:

{
  "hits": [
      {
        "_index": "company",
        "_id": "1",
        "_score": 2,
        "_source": {
          "id": 1,
          "name": "Acme",
          "ceo": "peo1",
          "ceo_fullname": "Alice Wonder"
        }
      },
      {
        "_index": "company",
        "_id": "2",
        "_score": 1,
        "_source": {
          "id": 2,
          "name": "Bueno"
        }
      }
    ]
}

Aggregating based on a projected field

A projected field can be used as part of the aggregation. For example, when using the company and person indices, you can project the values of the field age of the matching documents from the person index and aggregate the documents from the company index by using these values as follows:

$ curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '
{
 "query" : {
    "join" : {
      "indices" : ["person"],
      "on" : ["_id", "employed_by.keyword"],
      "request" : {
        "project" : [
          { "field" : { "name" : "age", "alias" : "employee_ages" } }
        ],
        "query" : {
          "match_all" : {}
        }
      }
    }
  },
  "aggs": {
    "by_company": {
      "terms": {
        "field": "name.keyword"
      },
      "aggs": {
        "average_employee_age": {
          "avg": {
            "field": "employee_ages"
          }
        }
      }
    }
  }
}'

The response contains an aggregation result with a bucket for each company, and a sub-aggregation average_employee_age corresponding to the average age of employees in this company.

{
  "aggregations": {
    "by_company": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "Acme",
          "doc_count": 1,
          "average_employee_age": {
            "value": 37.5
          }
        },
        {
          "key": "Bueno",
          "doc_count": 1,
          "average_employee_age": {
            "value": 20
          }
        }
      ]
    }
  }
}

In this example, the query joins the company index with the person index using the id and employed_by fields. The query projects the age field from the person index using an alias employee_age. The aggregation is performed using a runtime field employee_ages, which iterates through the projected employee_age field. The documents are first aggregated by company name, then a sub-aggregation average_employee_age computes the average age of employees in each bucket.

Data example for company and person indices

This section will feature examples based on two indices, company and person.

The documents are indexed using the dynamic field mapping feature provided by Elasticsearch https://www.elastic.co/guide/en/elasticsearch/reference/8.12/dynamic-field-mapping.html.

The examples involving these indices are based on the following documents:

$ curl -H 'Content-Type: application/json' -XPUT 'http://localhost:9200/_bulk?pretty' -d '
{ "index" : { "_index" : "company", "_id" : "1" } }
{ "id": 1, "name" : "Acme", "ceo": "peo1", "ceo_full_name": "Alice Wonder" }
{ "index" : { "_index" : "company", "_id" : "2" } }
{ "id": 2, "name" : "Bueno"}
{ "index" : { "_index" : "company", "_id" : "3" } }
{ "id": 3, "name" : "Ark" }
{ "index" : { "_index" : "person",  "_id" : "peo1" } }
{ "id" : 1, "first_name" : "Alice", "last_name":"Wonder", "employed_by" : "1", "age": 50 }
{ "index" : { "_index" : "person", "_id" : "peo2" } }
{ "id" : 2, "first_name" : "Bob", "last_name":"Dylan", "employed_by" : "2", "age": 20 }
{ "index" : { "_index" : "person", "_id" : "peo3" } }
{ "id" : 3, "first_name" : "Carol", "last_name":"Aird", "employed_by" :"1", "age": 25 }
'