Configuring NLP for Siren Investigate

Using the siren-nlp ingest processor in an Elasticsearch pipeline

The Siren NLP plugin provides an Elasticsearch ingest processor named siren-nlp.

You can create a pipeline that contains the siren-nlp ingest processor, index a document by using the pipeline, and view the enriched document by using the Dev Tools page in Siren Investigate.

Data can also be enriched by using the siren-nlp processor during Excel/CSV import or when using data reflection in the Elasticsearch pipeline definition step.

The resulting document will contain a new field called siren.nlp, which contains data that represents the annotations that are added by Siren NLP.

Complete the following steps:

  1. Create a pipeline in Elasticsearch to define the NLP processing.

    PUT _ingest/pipeline/nlp-pipeline
    {
      "processors" : [
        {
          "siren-nlp" : {
            "fields": ["title", "snippet"]
          }
        }
      ]
    }
  2. Index a document by using the nlp-pipeline.

    PUT testnlp/_doc/1?pipeline=nlp-pipeline
    {
      "title": "Bill Gates",
      "snippet": "Bill Gates is best known as the founder of the multi-national technology company Microsoft"
    }
  3. View the enriched document.

    GET testnlp/_source/1
    
    Response:
    {
      "snippet" : "Bill Gates is best known as the founder, of the multi-national technology company Microsoft",
      "siren" : {
        "nlp" : {
          "instances" : {
            "snippet" : {
              "entity/person" : [
                ...

Configuring the siren-nlp ingest processor

The siren-nlp ingest processor definition has only one compulsory field - fields - which defines the source fields that you want to process. Other optional fields and their default values are described in the following table.

Name Required Default Description

fields

yes

-

The list of fields that will be processed.

target_field

no

siren.nlp

The new field that will be created to store the annotation.

start_offset_field

no

start

The name of the field that will hold the start offset for each annotation.

end_offset_field

no

end

The name of the field that will hold the start offset for each annotation.

processors

no

Siren NLP Processors

A list of NLP processors and their configurations. For more information, see Siren NLP processors.

include_matches

no

true

If set to false, the matches field will not be included in the target_field.

include_ids

no

true

If set to false, the ids field will not be included in the target_field.

include_taxonomy_annotated

no

true

If set to false the taxonomy_annotated field will not be included in the target_field.

For example, you can configure the ingest processor as follows:

{
  "processors" : [
    {
      "siren-nlp" : {
        "fields": ["title", "snippet"],
        "start_offset_field": "start",
        "end_offset_field": "end",
        "include_matches": true,
        "include_ids": true,
        "include_taxonomy_annotated": true,
        "processors": [
          {
            "class" : "Telephone"
          }
        ]
      }
    }
  ]
}

Output

The new target_field that is created during NLP enrichment contains the following subfields:

instances: Holds instance objects for each NLP match that is extracted, categorized first by the source field, then by the annotation type. Each instance object contains the following fields:

  • match: The exact text of the annotated span.

  • start: The position (zero-indexed) of the first character of the match within the source field (the name of this field is specified by the start_offset_field setting in the pipeline and defaults to "start").

  • end: The position (zero-indexed) after the last character of the match within the source field. The name of this field is specified by the end_offset_field setting in the pipeline and defaults to "end".

  • type: The type value of the entity.

  • fromfield: The source field that the match was found in.

  • id: An identifier for the annotation, which is specific to each processor.

  • Additional fields may be included, which are specific to each processor.

Example of the instances subfield and its contents:

"instances" : {
  "snippet" : {
    "entity/person" : [
      {
        "nerType" : "Person",
        "probability" : 0.9501721598726716,
        "start" : 0,
        "match" : "Bill Gates",
        "end" : 10,
        "id" : "entity/person:bill gates",
        "type" : "entity/person",
        "fromfield" : "snippet"
      },
      ...

matches: Specifies exact matches from all fields that are analyzed, categorized by entity type:

"matches" : {
  "entity/person" : [
    "Bill Gates"
  ],
  "entity/organization" : [
    "Microsoft"
  ]
}

ids: all id values from any field analyzed, categorized by entity type:

"ids" : {
  "entity/person" : [
    "entity/person:bill gates"
  ],
  "entity/organization" : [
    "entity/organization:microsoft"
  ]
}

taxonomy_annotated: A copy of each source field with annotated text. For more information, see Using Taxonomies.

Siren NLP processors

The processors directive in the siren-nlp configuration is a list of processor object configurations, each with a class and, for some processors, a settings object:

{
  "class" : "Url",
  "settings":{
    "lenient": "true"
  }
}

The following processor classes are available:

Class Default* Settings

Telephone

yes

-

USTelephone

yes

-

Email

yes

-

IPv4

yes

-

IPv6

yes

-

MacAddress

yes

-

Url

yes

  • lenient: If set to true, it will match in a more lenient fashion. For example, both "google.com" and "www.google.com" will provide matches. The default setting is false.

SortCode

yes

-

HashTag

yes

-

NER

yes (one for each type)

  • nerType: Location, Organization or Person (English name finder models from OpenNLP).

CustomRegex

no

  • pattern: The regex pattern to match on; required.

  • caseInsensitive: Use a case-insensitive flag in the regex matching; the default value is true.

  • type: A value that will be passed on to the type field in the output; required.

Taxonomy

no

For more information, see Using Taxonomies.

If you do not include a list of processors in the siren-nlp configuration, the default processors will be included.

The following table lists the output of each processor within each instance object:

Class Output type Output id Example Output Instance

Telephone

entity/phonenumber

entity/phonenumber:[match lowercased]

{"start": 0, "match": "tel. 01229368123", "end": 16, "id": "entity/phonenumber:tel. 01229368123", "type": "entity/phonenumber", "fromfield": "title"}

USTelephone

entity/telephone

entity/phonenumber:[match lowercased]

{"start" : 0, "match" : "301-496-4000", "end" : 12, "id" : "entity/phonenumber:301-496-4000", "type" : "entity/phonenumber", "fromfield" : "title"}

Email

entity/email

entity/email:[match lowercased]

{"start": 0, "match": "email@example.com", "end": 17, "id": "entity/email:email@example.com", "type": "entity/email", "fromfield": "title"}

IPv4

entity/ipaddress

entity/ipaddress:[match lowercased]

{"start":0, "match": "172.16.254.1", "end": 12, "id": "entity/ipaddress:172.16.254.1", "type": "entity/ipaddress", "fromfield": "title"}

IPv6

entity/ipaddress

entity/ipaddress:[match lowercased]

{"start" : 0, "match": "0123:4567:89ab:cdef:0123:4567:89ab:cdef", "end": 39, "id": "entity/ipaddress:0123:4567:89ab:cdef:0123:4567:89ab:cdef", "type" : "entity/ipaddress", "fromfield" : "title"}

MacAddress

yes

-

{"start": 0, "match": "00-D0-56-F2-B5-12", "end": 17, "id": "entity/macAddress:00-d0-56-f2-b5-12", "type": "entity/macAddress", "fromfield": "snippet"}

Url

entity/url

entity/url:[match lowercased]

{"start": 0, "match": "www.google.com", "end": 14, "id": "entity/url:www.google.com", "type": "entity/url", "fromfield": "title"}

SortCode

entity/financialAccount

entity/financialAccount:[match lowercased]

{"start": 0, "match": "11-24-76", "end": 8, "id": "entity/financialAccount:11-24-76", "type": "entity/financialAccount", "fromfield": "title"}

HashTag

entity/hashtag

entity/hashtag:[match lowercased]

{"start": 0, "match": "#photooftheday", "end": 14, "id": "entity/hashtag:#photooftheday", "type": "entity/hashtag", "fromfield": "title"}

NER

entity/person, entity/organization, entity/location

entity/organization:[match lowercased] etc

{"nerType": "Organization", "probability": 0.8328421100140456, "start": 0, "match": "IBM", "end": 3, "id": "entity/organization:ibm", "type" : "entity/organization", "fromfield" : "title"}

CustomRegex

value given in type setting

[value in type setting]:[match lowercased]

{"start": 0, "match": "1984", "end": 4, "id": "year:1984", "type": "year", "fromfield": "title"}

Taxonomy

entity/telephone

entity/telephone:[match lowercased]

For more information, see Using Taxonomies.

Using Taxonomies

Taxonomy indices

The Siren NLP Taxonomy processor can be used to match concepts and their synonyms, which are stored as a hierarchical classification with text in the source field. The Siren NLP plugin can read a taxonomy from an index before it is used in an indexing pipeline.

To use an index as a taxonomy in the Taxonomy processor, the index must have:

  • A field with a unique ID for each record and a field containing one or more parent;

  • A field listing the parent IDs, so that a hierarchy can be constructed connecting all of the records in the index; and

  • At least one field that contains a string or a list of strings (synonyms) to match against the source field.

Configuring the Taxonomy processor

The Taxonomy processor allows you to make the following settings:

Taxonomy Setting Required Default Description

index

yes

-

The name of the index that contains the taxonomy data.

idField

yes

-

A field name in the taxonomy index whose value is a unique identifier for the taxonomy node.

preferredTermField

yes

-

A field name in the taxonomy index whose value is a preferred term for the taxonomy node.

synonymFields

yes

-

A list of field names in the taxonomy index from which to collect synonyms for matching to the ingested document.

parentsField

yes

-

A field name in the taxonomy index whose value is a list of document IDs of the parent nodes of this one. This will be used to calculate the paths to each node by comparing with the values in the idField. Any root nodes (at the top of the hierarchy) would have an empty list in this field.

caseSensitive

no

false

If set to true, matches synonyms to the ingested text case-sensitively.

exactWhitespace

no

false

If set to true, matches multi-token taxonomy synonyms with single spaces between tokens to ingest text with any whitespace between tokens.

plurals

no

true

If set to true, matches plurals in ingested text with singular synonyms in the taxonomy.

additionalData

no

false

If set to true, data from matched taxonomy records (id, preferred_term, synonyms, and parents) will be added to the annotation object for a matched span of text.

The following example shows a typical taxonomy, stored in the index cars:

{
  "id" : "Renault_Alpine_GTA/A610",
  "preferred_term" : "Renault Alpine GTA/A610",
  "synonyms" : [
    "Renault Alpine GTA/A610"
  ],
  "parents" : [
    "Renault",
    "Sports_car"
  ]
}

The corresponding configuration for the Taxonomy processor might be as follows:

{
  "class": "Taxonomy",
  "settings": {
    "index": "cars",
    "idField": "id",
    "preferredTermField": "preferred_term",
    "synonymFields": ["synonyms"],
    "parentsField": "parents",
    "caseSensitive": false,
    "exactWhitespace": false,
    "plurals": true,
    "type": "taxonomy-cars",
    "additionalData": true
  }
}

When documents are indexed by using the Taxonomy processor, instance objects will be created in the target_field for each match to a synonym in each source_field.

Output

If you have set the additionalData parameter to true, the following fields are included in the Taxonomy instance objects:

  • preferredTerm: The value in the preferredTermField.

  • synonyms: All synonyms that are collected from synonymFields.

  • parents: The value in the parentsField.

  • id_paths: A list of strings that represent paths to the matched taxonomy node from the root of the taxonomy. They are composed of node IDs, for example, ["|Car|Cars_by_Manufacturer|Volkswagen|Volkswagen_Golf"].

  • pt_paths: The same as id_paths, but each path is composed of preferred_terms instead of IDs. This is useful for display if node IDs are obscure.

  • ancestors: All ancestor node IDs of this node, up to and including the root node.

The following is an example of the corresponding output:

{
  "id_paths" : [
    "|Car|Cars_by_Manufacturer|Ford|Ford_Focus",
    "|Car|Cars_by_Class|Compact_car|Ford_Focus"
  ],
  "synonyms" : [
    "Ford Focus"
  ],
  "preferredTerm" : "Ford Focus",
  "start" : 0,
  "match" : "Ford Focus",
  "pt_paths" : [
    "|Car|Cars by Manufacturer|Ford|Ford Focus",
    "|Car|Cars by Class|Compact car|Ford Focus"
  ],
  "end" : 10,
  "id" : "Ford_Focus",
  "type" : "taxonomy-cars",
  "fromfield" : "title",
  "ancestors" : [
    "Car",
    "Cars_by_Manufacturer",
    "Cars_by_Class",
    "Ford",
    "Compact_car"
  ],
  "parents" : [
    "Ford",
    "Compact_car"
  ]
}

Search features using Taxonomies

Creating an index capable of taxonomy path hierarchy search and annotated text search

The command below can be used in Siren DevTools to create an index for which it will be possible to take advantage of search features on taxonomy annotation, once data is ingested into it using the siren-nlp Taxonomy processor:

PUT myindex
{
  "settings": {
    "analysis": {
      "analyzer": {
        "siren_taxonomy_analyzer": {
          "tokenizer": "siren_taxonomy_tokenizer"
        }
      },
      "tokenizer": {
        "siren_taxonomy_tokenizer": {
          "type": "path_hierarchy",
          "delimiter": "|"
        }
      }
    }
  },
  "mappings": {
    "dynamic_templates": [
      {
        "hierarchy": {
          "path_match": "*.instances.*.id_paths",
          "mapping": {
            "type": "text",
            "fields": {
              "tree": {
                "type": "text",
                "analyzer": "siren_taxonomy_analyzer",
                "search_analyzer": "keyword",
                "fielddata": true
              }
            }
          }
        }
      },
      {
        "annotated_text": {
          "path_match": "*.taxonomy_annotated.*",
          "mapping": {
            "type": "annotated_text"
          }
        }
      }
    ]
  }
}

Path Hierarchy Search

Indices created with the index creation command above will map taxonomy instance id_paths fields with a multifield, tree, tokenized using the path_hierarchy tokenizer. This will allow, for example, a record whose siren.nlp.instances.snippet.taxonomy-cars.id_paths field contains "|Car|Cars_by_Class|Sports_car|Alpine_A110" to be returned in a search for "|Car|Cars_by_Class|Sports_car", essentially searching for mention of any sports car.

Annotated Text Search

If the siren-nlp ingest processor has been used with include_taxonomy_annotated set to true, a new field taxonomy_annotated, in target_field is created. This will contain a subfield for each source_field, with Elastic annotated-text annotations for each ancestor term of its matching taxonomy node.

For example, if we have used a taxonomy with the structure:

Cars
  >Cars_by_Manufacturer
    >Alpine
      >Alpine_A110
  >Cars_by_Class
    >Sports_car
      >Alpine_A110

the resulting taxonomy_annotated field might be:

"siren" : {
  "nlp" : {
    "taxonomy_annotated" : {
      "snippet" : "2017 [Alpine A110](Alpine_A110&Cars_by_Manufacturer&Car&Sports_car&Cars_by_Class&Alpine) for sale, 46000 miles, $44,000.",
      "title" : "[Alpine](Alpine&Cars_by_Manufacturer&Car) for sale"
    },
    ...

In the siren.nlp.taxonomy_annotated.snippet field, the text span Alpine A110 has been annotated with all of the ancestor terms of the Alpine_A110 node in the cars taxonomy, and the span Alpine in the siren.nlp.taxonomy_annotated.title fieldd has been annotated with the ancestor terms of the Alpine node. (Note that where multiple matches are partially overlapping only the first match is annotated).

If the index creation command above was used to create the index, these subfields will have an annotated_text mapping, and the text will therefore be searchable by using the taxonomy node ID as well as plain text. (Note that use of the annotated_text ampping requires the mapper-annotated-text plugin to be installed - see Installing the mapper-annotated-text plugin).

The following proximity query illustrates the power of the annotated text search. It aims to find all sports cars for sale by searching the snippet_nlp.taxonomy_annotated.snippet field for mention of the word sale within 6 words of a span of text annotated with sports_car:

{
  "query": {
    "span_near": {
      "slop": 6,
      "in_order": false,
      "clauses": [
        {
          "span_term": {
            "snippet_nlp.taxonomy_annotated.snippet": "sale"
          }
        },
        {
          "span_term": {
            "snippet_nlp.taxonomy_annotated.snippet": "Sports_car"
          }
        }
      ]
    }
  }
}

Installing the mapper-annotated-text plugin

To install the mapper_annotated_text plugin:

$ ./elasticsearch/bin/elasticsearch-plugin  install mapper-annotated-text
-> Downloading mapper-annotated-text from elastic
[=================================================] 100%
-> Installed mapper-annotated-text