Source identifiers

In all versions of i2 Analyze, the ETL pipeline requires you to provide origin identifiers for incoming records through the ingestion mappings that you write. Starting from i2 Analyze 4.3.5, developers of connectors for i2 Connect and plug-ins for i2 Analyst's Notebook can explicitly provide source identifiers for the records that they create.

Note: For high-level descriptions of source identifiers and origin identifiers (and the differences between them), see Identifiers in i2 Analyze records.

When you attach recognizable identifiers to the records that you create in connector code, you enable services to perform operations based on those identifiers. You also enable i2 Analyze clients and servers to perform matching on records that have a shared source.

The rules that govern the structure and contents of source identifiers are similar to - but not the same as - the rules for origin identifiers. Similar to their definitions, the rules for source identifiers are sometimes less restrictive than those for origin identifiers.

The structure of a source identifier

A source identifier contains a type and a key. If a source identifier is attached to a record that is subsequently uploaded to the Information Store, then the identifier is stored with that record in the database. As a result, there are limits on how much information types and keys can store.

type

The type of a source identifier allows the services in an i2 Analyze deployment to determine whether the source identifier is "known" to them - that is, that they can understand the key.

The value of the type element does not have to be meaningful, but should be unique to your services so that you avoid clashes with any third-party services you might use.

The length of the source identifier's type must not exceed 200 bytes, which is equivalent to 100 2-byte Unicode characters.
The following types are reserved and must not be used (case-insensitive):
- OI.IS
- OI.DAOD
- OI.ANB
- Anything starting with i2.

key

The key of a source identifier is an array containing the information necessary to reference the data in its source. The pieces of information that you use to make up the key differ depending on the source of the data.

The total length of the source identifier's key must not exceed 692 bytes, which is equivalent to 346 2-byte Unicode characters.
The key is stored as a serialized JSON array, so additional characters appear alongside the actual key elements: two characters are required for the array braces, and two quotes are required for each element, with commas as separators between elements.

In other words, a key with N elements requires 3N + 1 characters of overhead. For example, the total length of ["a","bc","defg"] is 17 characters, while ["a,bc,defg"] is 13 characters. Also, if some special characters are present in a key element, they must be escaped for storage. For example, " becomes \", which further increases the size of the key.

Using source identifiers

When a service receives a seed record, it can inspect the source identifiers to find out whether a known type is present. If it is, then the key can be used to retrieve or match information from the source system.

When records are returned from external sources, the client can match them against other records by using the source identifiers (if source identifier matching is enabled in the match rules with enableSourceIdentifierMatching="true"). Also, if source identifier matching is configured in the system match rules, then source identifiers provided through i2 Connect or Analyst's Notebook can be matched against the origin identifiers of records in the Information Store.

In both cases, source identifier matches occur when record item types match and the type and keys of the identifiers are an exact (case-sensitive) match.

Note: After upgrading to i2 Analyze 4.3.5 from a previous release, if you want to match against the origin identifiers of already ingested records, you must build a new match index.

Limitations

There is a limit on the number of unique source identifiers that you can add to a record in the Information Store. The default limit is 50, but you can modify it by adding MaxSourceIdentifiersPerRecord=N (where N is a positive integer) to the DiscoServerSettingsCommon.properties file.

Origin identifiers are not stored in chart records. As a result, they are not present as source identifiers on seeds that are sent to connectors or Analyst's Notebook plug-ins. Only those source identifiers that have been added to records through the same mechanisms are present.

Connectors that specify source identifiers are API-compatible with older versions of the i2 Connect gateway and Analyst's Notebook, but those products will not recognize new source identifiers as such. Rather, they will be treated as ordinary identifiers or ignored. To use source identifiers to their full potential, upgrade all products to the latest release at your earliest opportunity.

Example

This example is based on the NYPD connector sample solution. When the external data source is queried, source identifiers are built and attached as the identifier of the data for the record, rather than using the unique identifier from the dataset. This arrangement allows the find-like-this-complaint service to extract the relevant information and use it in the query parameters.

Build a source identifier

To build a source identifier, you must provide the values for its type and key fields:

For the type, you need a unique value that you can use to identify your source identifiers. You can use the value to filter out any other source identifiers that are also attached to the seeds, such as system-generated source identifiers.
For the key, populate an array with information that can be used to build the query in the service. Then, using the source identifier type and key, assign it to the entity's identifier property.

In the following Java and Python examples, the type is assigned the value NYPD. The key is built using the values of the law_cat_cd, addr_pct_cd and cmplnt_num columns of the NYPD Complaint Data dataset.

Java example

In the ItemFactory.java file, define a method to build the source identifiers:

private Object generateSourceIdentifier(SocrataResponseData entry) {
  if (entry.offenceLevel != null && entry.precinctCode != 0) {
    final SourceIdentifier sourceId = new SourceIdentifier();
    sourceId.type = "NYPD";
    sourceId.key = Arrays.asList(entry.offenceLevel, String.valueOf(entry.precinctCode), entry.complaintNum);

    return new EntityDataIdentifierSourceId(sourceId);
  } else {
    return "COMP" + entry.complaintNum;
  }
}

This code uses the SourceIdentifier and EntityDataIdentifierSourceId classes from the supplied Java implementation of the i2 Connect Gateway REST SPI. You'll need to import them at the top of the file:

import com.i2group.connector.spi.rest.transport.SourceIdentifier;
import com.i2group.connector.spi.rest.transport.EntityDataIdentifierSourceId;

Then, in the createComplaint() method, set the complaint.id value to call the new method:

complaint.id = generateSourceIdentifier(entry);

Python example

In the classes.py file, define a global function to build the source identifiers:

def generate_source_identifier(entry):
    source_identifier = SourceIdentifier(
        type="NYPD",
        key=[entry.get('law_cat_cd'), entry.get('addr_pct_cd', ''), entry.get('cmplnt_num')]
    )
    return EntityDataIdentifierSourceId(source_identifier)

This code uses the SourceIdentifier and EntityDataIdentifierSourceId classes from the supplied Python implementation of the i2 Connect Gateway REST SPI. You'll need to import them at the top of the file:

from spi.models.source_identifier import SourceIdentifier
from spi.models.entity_data_identifier_source_id import EntityDataIdentifierSourceId

Then, on the id property in the Complaint class, replace the call to the method get_id(base, entry) with generate_source_identifier(entry):

class Complaint(Entity):
    """
    A base class for Complaint entities.

    Attributes:
        entry (dict): One record from the external data source.
    """
    def __init__(self, entry):
        id = generate_source_identifier(entry)

Update the find-like-this-complaint service

To use the source identifiers that are now attached to the seeds, you need to extract and filter the sourceIds to find the ones with the type that you created. You can use the key of those source identifiers in your query parameters to the service.

In the following Java and Python examples, the source identifiers are filtered down to those whose type property contains the value NYPD. Their key values are then used to populate the request parameters. Instead of the existing parameters, which searched for matching records with the same level of offense, the request is updated to search for matching records with the same level of offense within the same precinct.

Java example

In the ExternalConnectorDataService.java file, replace the findLikeThisComplaint() method with the following:

public I2ConnectData findLikeThisComplaint(DaodSeeds seeds) {
  final DaodSeedEntityData seed = seeds.entities.get(0);

  final List<SeedSourceIdentifier> sourceIds = seed.sourceIds;

  // filter out system generated source ids
  final List<SeedSourceIdentifier> nypdSourceIds = sourceIds
      .stream()
      .filter(sourceId -> sourceId.type.equals("NYPD"))
      .collect(Collectors.toList());

  if (nypdSourceIds.isEmpty()) {
    return new I2ConnectData();
  }

  final Map<String, Object> params = new HashMap<>();
  params.put("limitValue", 50);
  params.put("lawCategory", nypdSourceIds.get(0).key.get(0)); // Level of offence
  params.put("precinctCode", nypdSourceIds.get(0).key.get(1)); // Precinct code

  final String url = LIMIT_PARAM + "&$where=law_cat_cd='{lawCategory}'&addr_pct_cd='{precinctCode}'";
  final SocrataResponse response = socrataClient.get(url, SocrataResponse.class, params);

  final I2ConnectData connectorResponse = new I2ConnectData();
  connectorResponse.entities = response.stream().map(itemFactory::createComplaint).collect(Collectors.toList());
  connectorResponse.links = Collections.emptyList();
  return connectorResponse;
}

Python example

In the service.py file, update the if seeds: block of code in the impl_find_like_this_complaint() method with the following:

    if seeds:
        source_ids = seeds['entities'][0]['sourceIds']
        nypd_source_ids = []

        # filter out system generated source ids
        for sourceId in source_ids:
            if sourceId['type'] == 'NYPD':
                nypd_source_ids.append(sourceId)

        if not nypd_source_ids:
            return response

        params = f"&$where=law_cat_cd='{nypd_source_ids[0]['key'][0]}'&addr_pct_cd='{nypd_source_ids[0]['key'][1]}'"

        records = query_external_datasource(params)
        response = marshal(records, type_ids['complaint'], False)