Matching algorithms in IBM Match 360
IBM Match 360 with Watson uses matching algorithms to resolve data records into master data entities. Data engineers can define different matching algorithms for each entity type in their data. The matching algorithms can then analyze the data to evaluate and compare records, and then collect matched records into entities.
There are two common reasons to run matching on your data:
- For record deduplication and entity resolution, the matching process analyzes your data to determine whether any duplicate records exist in your data. Suspected duplicate records are merged into master data entities to establish a single, trusted, 360-degree view of your data.
- To create other types of entity associations, the matching process analyzes your data to collect records into entities that represent different kinds of groupings, such as a household.
Watch the following video to see how to use IBM Match 360 to set up a matching algorithm for a customized data model.
This video provides a visual method to learn the concepts and tasks in this documentation.
In this topic:
- Matching to create more than one type of entity
- The matching process
- Components of the matching algorithm
Matching to create more than one type of entity
IBM Match 360 matching algorithms are driven by the entity type of the associated data. You can define more than one entity type for each record type in the data model. For each entity type, configure and tune its corresponding matching algorithm to ensure that IBM Match 360 creates entities that meet your organization's requirements.
A single record can be part of more than one separate entity. If your data model includes more than one entity type, you can run different types of matching across the same data set. For example, consider a data set that includes person records from across your enterprise. If the Person record type includes definitions for a Person entity type and a Household entity type, then you can run the Person matching algorithm for entity resolution and deduplication, and also run the Household matching algorithm to create entities made up of person records that belong to the same household.
The matching process
The matching engine goes through a defined process to match records into entities. The matching process includes three major steps:
-
Standardization. During this step, the algorithm standardizes the format of the data so that it can be processed by the matching engine.
-
Bucketing. The algorithm sorts data into various categories or "buckets" so that it can compare like-to-like pieces of information.
-
Comparison. The algorithm compares data to determine a final comparison score. The algorithm then uses the comparison score to determine whether the records are a match.
Each of these steps is defined and configured by the matching algorithm.
Components of the matching algorithm
Three main types of components define an IBM Match 360 matching algorithm:
Standardizers
As the name suggests, standardizers define how data gets standardized. Standardization enables the matching algorithm to convert the values of different attributes to a standardized representation that can be processed by matching engine.
The matching algorithm uses multiple standardizers. Each standardizer is suited to process specific attribute types found in record data.
Standardizers are defined by JSON objects. Each standardizer's JSON object definition contains three elements:
-
label
- A label that identifies this standardizer. -
inputs
- Theinputs
list has one element, which is a JSON object. That JSON object has two elements:fields
andattributes
:fields
- The list of fields to use for standardization.attributes
- The list of attributes to use for standardization.
-
standardizer_recipe
- A list of JSON objects in which each object represents one step to be run during the standardization process of the associated standardizer. Each object in thestandardizer_recipe
list consists of four main elements:label
- A label that identifies this step in the standardizer recipe.method
- The internal method used. This element is just for reference and must not be edited.inputs
- A single element of theinputs
list defined one level higher.fields
- A list of the fields to be used for this step. This is generally a subset of all the fields defined within theinputs
list one level higher. Not every step needs to process all of theinputs
fields.set_resource
- The name of aset
type customizable resource used for this step.map_resource
- The name of amap
type customizable resource used for this step.
Depending on the behavior of a step, there might be more configuration elements that are required in the corresponding JSON object.
Preconfigured standardizers
The following standardizers are ready to use in IBM Match 360. The preconfigured standardizers are also customizable.
Person Name standardizer
This standardizer is used to standardize Person Name attribute values. It contains the following recipes, in sequence:
Upper case
- Converts the input field values to use their uppercase equivalents.Map character
- Converts UNICODE input characters to equivalent English alphabet characters. Optionally, define the map in the IBM Match 360 resources.Tokenizer
- Tokenizes the input field value into multiple tokens, based on the defined list of delimiters.Parse token
- Parses the input field values to different tokens, depending on the predefined values in the IBM Match 360 resources. For example, you can use this recipe to parse suffix, prefix, and generation values into appropriate fields.Length
- Discards tokens that are outside a given length range. Minimum and maximum values are defined in the IBM Match 360 resources.Stop token
- Removes anonymous input values, as configured.Pick token
- Selects a subset (or all) of the tokens as the standardized data to use in bucketing and comparison.
The Person Name standardizer uses the following Map resources by default:
map_character_general
- Converts UNICODE input characters to equivalent English alphabet characters.person_map_name_alignments
- Parses suffix, prefix, and generation values into appropriate fields.
The Person Name standardizer uses the following Set resources by default:
person_set_name_aname
- Removes anonymous person name values.
Organization Name standardizer
This standardizer is used to standardize Organization Name attribute values. It contains the following recipes, in sequence:
Upper case
- Converts the input field values to use their uppercase equivalents.Map character
- Converts UNICODE input characters to equivalent English alphabet characters. Optionally, define the map in the IBM Match 360 resources.Stop character
- Removes unwanted input characters from name values.Map token
- Generates nicknames or alternate names for the given input and stores the information in a separate new internal field.Tokenizer
- Tokenizes the input field value into multiple tokens, based on the defined list of delimiters.Stop token
- Removes anonymous input values, as configured.Acronym
- Generates an acronym for the given organization name and stores the information in a separate new internal field. This acronym value is used during comparison to handle abbreviated names.Pick token
- Selects a subset (or all) of the tokens as the standardized data to use in bucketing and comparison.
The Organization Name standardizer uses the following Map resources by default:
map_character_general
- Converts UNICODE input characters to equivalent English alphabet characters.org_map_name_cnick_name
- Generates nicknames or alternate names for the given input.
The Organization Name standardizer uses the following Set resources by default:
org_set_name_aname
- Removes anonymous organization name values.
Date standardizer
This standardizer is used to standardize Date attribute values. It supports many different date formats and contains the following recipes, in sequence:
Map character
- Converts slash characters (/
) to dash characters (-
).Date function
- Converts date inputs in different formats to a standardized format.Stop token
- Removes anonymous date values, as configured.Parse token
- Parses the input field values to different tokens, depending on certain regular expressions. For example, you can use this recipe to parse a full date input into day, month, and year tokens.Pick token
- Selects a subset (or all) of the tokens as the standardized data to use in bucketing and comparison.
The Date standardizer uses the following Map resources by default:
map_character_date_separators
- Converts slash (/
) or any other separator characters to dash characters (-
).map_date_tokens_year_month_day
- Parses the input date value to internal fields, namelybirth_year
,birth_month
andbirth_day
, based on regular expressions.
The Date standardizer uses the following Set resources by default:
set_date_date
- Removes anonymous date values.
Gender standardizer
This standardizer is used to standardize Gender attribute values. It contains the following recipes, in sequence:
Map character
- Converts UNICODE input characters to equivalent English alphabet characters. Optionally, define the map in the IBM Match 360 resources.Upper case
- Converts the input field values to use their uppercase equivalents.Stop token
- Removes anonymous input gender values, as configured.Map token
- Converts input token values to equivalent values, as configured in the IBM Match 360 resources.Parse token
- Parses processed field values to an appropriate internal field.Pick token
- Selects a subset (or all) of the tokens as the standardized data to use in bucketing and comparison.
The Gender standardizer uses the following Map resources by default:
map_character_general
- Converts UNICODE input characters to equivalent English alphabet characters.map_gender_gender
– Maps different input gender values to standard values.map_gender_tokens_gender
- Parses the input token value to internalgender
field based on regular expression.
The Gender standardizer uses the following Set resources by default:
set_gender_anon_gender
- Removes anonymous input gender values.
Address standardizer
This standardizer is used to standardize Address attribute values. Addresses can have several different formats, depending on the locales. This flexibility requires complex processing to convert addresses to a standardized form. The Address standardizer contains the following recipes, in sequence:
Upper case
- Converts the input field values to use their uppercase equivalents.Map character
- Converts UNICODE input characters to equivalent English alphabet characters. Optionally, define the map in the IBM Match 360 resources.Map token
- Converts input token values to equivalent values, as configured in the IBM Match 360 resources. For example, "United States of America", "United States", and "US" can all be mapped to "USA". This mapping is common for country and province/state field values. In addition, delimiter characters configured in the resource are mapped to the space character.Tokenizer
- Tokenizes the input field value into multiple tokens, based on the defined list of delimiters.Stop token
- Removes anonymous input values, such as postal codes, as configured.Keep token
- Allows only the defined list of values for a given field. For example, you might define a list of postal codes that are allowed during standardization. Input values that are not in the allowed list will be removed.Parse token
- Parses the input field values to appropriate internal fields depending on certain regular expressions and predefined values, as configured in the resources. You can use this recipe to truncate a given token to a certain length by using regular expressions. You can also define different alphanumeric pattern sets in the form of regular expressions to allow only certain patterns.Join fields
- Joins two or more fields together to create a new combined value, assigned to an internal field. For example,latitude
andlongitude
field values can be joined together to form a new internal field calledlat_long
.Pick token
- Selects a subset (or all) of the tokens as the standardized data to use in bucketing and comparison.
The Address standardizer uses the following Map resources by default:
map_character_general
- Converts UNICODE input characters to equivalent English alphabet characters.map_address_country
- Converts input country values to equivalent values.map_address_province_state
- Converts input province and state values to equivalent values.map_address_delimiter_removal
- Maps delimiter characters configured in the resource to the space character.map_address_addr_tok
- Converts input address token values to equivalent values.map_address_tokens_unit_type_and_number
- Parses the input fieldresidence_number
based on regular expression to internal fields, namelyunit_type
andunit_number
.map_address_tokens_street_number_name_direction_type
- Parses the input fieldaddress_line1
based on regular expression to internal fields, namelystreet_number
,street_name
,direction
, andstreet_type
.map_address_tokens_sub_division
- Parses the input fieldaddress_line2
based on regular expression to the internal fieldsub_division
.map_address_tokens_pobox_type_and_number
- Parses the input fieldaddress_line3
based on regular expression to internal fields, namelypobox_type
andpobox
.map_address_tokens_city
- Parses the input value of thecity
field based on regular expression.map_address_tokens_province
- Parses the input value of theprovince_state
field based on regular expression to the internal fieldprovince
.map_address_tokens_postal_code
- Parses the input value of the fieldzip_postal_code
based on regular expression to the internal fieldpostal_code
.map_address_tokens_country
- Parses the input value of the fieldcountry
based on regular expression.map_address_tokens_latitude
- Parses the input value of fieldlatitude_degrees
based on regular expression to the internal fieldlatitude
.map_address_tokens_longtitude
- Parses the input value of the fieldlongitude_degrees
based on regular expression to the internal fieldlongitude
.
The Address standardizer uses the following Set resources by default:
set_address_postal_code
- Removes anonymous input values forzip_postal_code
.
Phone standardizer
This standardizer is used to standardize Phone attribute values. It contains the following recipes, in sequence:
Stop character
- Removes unwanted input characters from phone values.Stop token
- Removes anonymous phone values, as configured.Phone
- Parses input phone numbers with different formats from different locales into a common format. This recipe can be configured to remove area codes and country codes from phone numbers. It can also retain a certain number of digits in a standardized phone number.Parse token
- Parses processed field values to an appropriate internal field depending on certain regular expressions, as configured in the resources.Pick token
- Selects a subset (or all) of the tokens as the standardized data to use in bucketing and comparison.
The Phone standardizer uses the following Map resources by default:
map_phone_tokens_phone
- Parses phone values to an internal field based on regular expressions.
The Phone standardizer uses the following Set resources by default:
set_character_phone
- Replaces all characters that are not alphanumeric. Enables you to specify regular expressions.set_phone_anon_phone
- Removes anonymous phone values.
Identification standardizer
This standardizer is used to standardize Identification attribute values. It contains the following recipes, in sequence:
Map character
- Converts UNICODE input characters to equivalent English alphabet characters. Optionally, define the map in the IBM Match 360 resources.Upper case
- Converts the input field values to use their uppercase equivalents.Stop character
- Removes unwanted input characters from identification values.Stop token
- Removes anonymous input values, as configured.Map token
- Converts input token values to equivalent values, as configured in the IBM Match 360 resources.Parse token
- Parses processed field values to an appropriate internal field depending on certain regular expressions, as configured in the resources.Pick token
- Selects a subset (or all) of the tokens as the standardized data to use in bucketing and comparison.
The Identification standardizer uses the following Map resources by default:
map_character_general
- Converts UNICODE input characters to equivalent English alphabet characters.map_identifier_equi_identifier
- Converts input token values to equivalent values.map_identifier_tokens_identification_number
- Parses processed field values to an appropriate internal field depending on certain regular expressions, as configured in the resources.
The Identification standardizer uses the following Set resources by default:
set_character_identification_number
- Removes non-alphanumeric input characters from identification values. Enables you to specify regular expressions.set_identifier_anonymous
- Removes anonymous identification values.
Email standardizer
This standardizer is used to standardize Email attribute values. It contains the following recipes, in sequence:
Map character
- Converts UNICODE input characters to equivalent English alphabet characters. Optionally, define the map in the IBM Match 360 resources.Upper case
- Converts the input field values to use their uppercase equivalents.Stop token
- Removes anonymous input values, as configured.Map token
- Converts input token values to equivalent values, as configured in the IBM Match 360 resources.Parse token
- Parses processed field values to an appropriate internal field depending on certain regular expressions, as configured in the resources.Pick token
- Selects a subset (or all) of the tokens as the standardized data to use in bucketing and comparison.
The Email standardizer uses the following Map resources by default:
map_character_general
- Converts UNICODE input characters to equivalent English alphabet characters.map_non_phone_equi_non_phone
- Converts input token values to equivalent values.map_non_phone_tokens_non_phone
- Parses the input fieldemail_id
based on regular expression to the internal fieldsemail_local_part
andemail_domain
.
The Email standardizer uses the following Set resources by default:
set_non_phone_anon_non_phone
- Removes anonymous email values.
Entity types (bucketing)
Within a single matching algorithm, each record type can have multiple entity type definitions (entity_type
JSON objects). For example, in an algorithm defined for a person record type, you might need to create more than one entity
type definition, such as person entity, household entity, location entity, and others.
Each entity type can be used to match and link records in different ways. An entity type defines how records are bucketed and compared during the matching process.
Each entity type definition (entity_type
) in the matching algorithm has several JSON elements:
-
clerical_review_threshold
- Records that have a comparison score lower than the clerical review threshold are considered as non-matches. -
auto_link_threshold
- Records that have a comparison score higher than the autolink threshold are considered to be strong enough matches that they are automatically matched. -
bucket_generators
- This section contains the definition of the bucket generators configured for an entity type. There are two types of bucket generators: buckets and bucket groups.-
Buckets involve bucketing for only one attribute. Each
bucket
definition includes four elements:label
- A label that identifies the bucket generator.maximum_bucket_size
- A value that defines the size of large buckets. Any bucket hash with a bucket size greater than this value is not considered for candidate selection during matching.inputs
- For buckets, theinputs
list has only one element, which is a JSON object. That JSON object has two elements:fields
andattributes
:fields
- The list of fields to use for bucketing.attributes
- The list of attributes to use for bucketing.
bucket_recipe
- A bucket recipe list defines the steps for the bucket generator to complete during the bucketing process. Eachbucket_recipe
list has a number of subelements:label
- A label that identifies the bucket recipe element.method
- The internal method used. This element is just for reference and must not be edited.inputs
- A single element of theinputs
list defined one level higher.fields
- A list of the fields to be used for this bucket. This is generally a subset of all the fields defined within theinputs
list one level higher.min_tokens
- The minimum number of tokens to use when the recipe is forming a bucket hash.max_tokens
- The maximum number of tokens to use together when the recipe is forming a bucket hash.count
- A limit on the number of bucket hashes for a single record that get generated out of a bucket generator. If a record generates a lot of bucket hashes, only the number of hashes set by this element get picked up.bucket_group
- The sequence number for a bucket group that produces a bucket hash. Intermediary steps or recipes would not be assigned a sequence number.order
- Specifies whether the tokens are sorted in lexicographical order when multiple tokens are combined to form a bucket hash.maximum_bucket_size
- A value that defines the size of large buckets. This element is the same as the one defined at the bucket generator level; also having it at the bucket recipe level gives you finer control over large individual buckets.
-
Bucket groups involve bucketing for more than one attribute. Each
bucket_group
definition includes five elements:label
- A label that identifies the bucket generator.maximum_bucket_size
- A value that defines the size of large buckets. Any bucket hash with a bucket size greater than this value is not considered for candidate selection during matching.inputs
- For bucket groups, theinputs
list has more than one JSON object element. The JSON objects each have two elements:fields
andattributes
:fields
- The list of fields to use for bucketing.attributes
- The list of attributes to use for bucketing.
bucket_recipe
- A bucket recipe list defines the steps for the bucket generator to complete during the bucketing process. Eachbucket_recipe
list has a number of subelements:label
- A label that identifies the bucket recipe element.method
- The internal method used. This element is just for reference and must not be edited.inputs
- A single element of theinputs
list defined one level higher.fields
- A list of the fields to be used for this bucket. This is generally a subset of all the fields that are defined within theinputs
list one level higher.min_tokens
- The minimum number of tokens to use when the recipe is forming a bucket hash.max_tokens
- The maximum number of tokens to use together when the recipe is forming a bucket hash.count
- A limit on the number of bucket hashes for a single record that get generated out of a bucket generator. If a record generates many bucket hashes, only the number of hashes set by this element get picked up.bucket_group
- The sequence number for a bucket group that produces a bucket hash. Intermediary steps or recipes would not be assigned a sequence number.order
- Specifies whether the tokens are sorted in lexicographical order when multiple tokens are combined to form a bucket hash.maximum_bucket_size
- A value that defines the size of large buckets. This element is the same as the one defined at the bucket generator level. Being able to define it at the bucket recipe level gives you finer control over large individual buckets.set_resource
- The name of aset
type resource used for a bucket recipe.map_resource
- The name of amap
type resource used for a bucket recipe.output_fields
- If this recipe produces new fields after it completes bucketing functions on the input fields, this element contains a list of the names of the generated fields.
bucket_group_recipe
- A bucket group recipe section is typically used for defining buckets that consist of more than one attribute. Every element of abucket_group_recipe
list is a JSON object defining the construct for a single bucket group.- The
inputs
list withinbucket_group_recipe
has more than one element, which means it refers to more than one attribute defined in theinputs
array one level higher. - The
fields
element is a list of lists. Every inner list of fields is associated with the respectiveattributes
list. min_tokens
andmax_tokens
lists have more than one element, with each element corresponding to respectiveattributes
list.
- The
Note:In some bucketing recipe definitions, there is a property that is named
search_only
. By default, its value isfalse
. If set totrue
, this property indicates that a bucket or bucket group is used only for probabilistic search scenarios and is not used for entity resolution (matching) scenarios.
-
-
compare_methods
- Definitions of the comparison methods that are configured for an entity type. Eachcompare_methods
JSON object consists of definitions of variouscompare
methods. The matching algorithm adds up the scores from eachcompare
method definition to get the final comparison score. Eachcompare
method's JSON object contains three elements:label
- A label that identifies thecompare
method.methods
- A list of comparators that form a comparison group. Every element in this array represents one comparator, meant for one type of matching attribute. The matching algorithm considers the maximum of the scores from all the comparators in amethods
list as the final score from this comparison group. Each comparator definition includes two elements:inputs
- For comparators, theinputs
list has only one element, which is a JSON object. That JSON object has two elements:fields
andattributes
:fields
- The list of fields to use for comparison.attributes
- The list of attributes to use for comparison.
compare_recipe
- This list is used mainly for defining the comparison steps. Typically, there is only one JSON element in this array, representing only one step for doing the comparison. This step has five elements:label
- A label that identifies the comparison step.method
- The internal method used. This element is just for reference and must not be edited.inputs
- A single element of theinputs
list defined one level higher.fields
- The fields to be used for this comparison out of all of the fields that are defined in theinputs
list one level higher.comparison_resource
- The name of a customizable comparison resource used for this comparison step.
weights
- Each comparison that is done by a comparator results in a number score from 0 to 10. This number is called the distance or dis-similarity measure. A distance of 0 indicates that the values being compared are exactly the same. A distance of 10 indicates that they are completely different. Corresponding to the 11 distinct values (0 - 10), 11 weights are defined for each comparator. After calculating the distance, the compare method determines the corresponding weight value from the weights list, resulting in the total comparison score. Data engineers can customize the weights as needed, based on the data quality, distribution, or other factors.
-
record_filter
- The record filtering element enables the matching engine to select records for matching based on their entity types. Each record filter definition contains one element:-
criteria
- Includes or excludes records from matching consideration based on specific conditions. This element contains one JSON object with a key-value pair.The key of the
criteria
JSON object is an attribute name. It can be either of the following:- The
record_source
system attribute. - A user-defined custom attribute of a simple attribute type (string).
- The
The value of the
criteria
JSON object is another JSON object containing one element, which can be either of the following:allowed
- An array of string values. Records that include any of these values will be considered during matching.disallowed
- An array of string values. Records that include any of these values will not be considered during matching.
-
-
source_level_thresholds
- Source-level thresholds enable you to define autolink and clerical review thresholds on a source-to-source basis. Source-level thresholds override the default global threshold values. Each source-level threshold configuration contains a collection of sources with optional source-specific default thresholds or a collection of source-to-source threshold pairs that enable you to define different thresholds for each source. For more information, see Configuring source-specific matching thresholds in the Advanced matching algorithm tuning topic.
Bucketing resources
The bucketing definitions use the following Map resources by default:
person_map_name_nickname
- Generates nicknames or alternate names for a given person name input.org_map_name_cnick_name
- Generates nicknames or alternate names for a given organization name input.
The bucketing definitions use the following Set resources by default:
person_set_name_bkt_anon
- Removes anonymous person name values.org_set_name_acname
– Removes anonymous organization name values.
Comparison functions
Comparison functions, sometimes called comparators, are one of the key components of the matching algorithm. Comparison functions are used by the matching engine to compare record data during the matching process. Essentially, record matching involves comparing different types of attributes between different records’ data.
For many of the commonly used attribute types in the person, organization, and location domains, the IBM Match 360 matching engine includes preconfigured comparison methods.
In IBM Match 360, comparison functions use an approach to comparison known as feature vectors. There are different customizable feature definitions in IBM Match 360 that are used for different comparison functions. Each comparison results in a measure of distance (a vector) that shows how dissimilar two given attribute values are.
In the matching algorithm, each discrete distance value is given a weight that determines how strongly to consider that value. The weight combines with the distance to produce a comparison score. The matching algorithm adds all of the comparison scores together to arrive at a final comparison score for the overall record-to-record comparison.
About features
A feature represents the fine-level details of a comparison function. Different types of attributes use different types of similarity checks, meaning that their features vary as well.
Feature definitions dictate the types of internal functions used for each comparison function. Examples of internal functions include exact match, edit distance, nickname, phonetic equivalent, or initial match.
Comparison resources
Each comparison method includes resources that contain the details of its internal comparison operations.
Each of the default comparison types has its own resources. See each comparison type for details of the associated resources.
For comparisons on custom attribute types that have a matching type of generic
, the generic comparison method includes the following resources:
compare_spec_generic
- In the generated algorithm, the name format of this resource isrecordType_entityType_compare_spec_generic
.
Person name comparisons
Different fields within a person name attribute are handled differently. For fields like prefix, suffix, and generation values, exactness or non-matching is checked. Other fields such as given name, last name, and middle name primarily use the following features:
- Exact match
- Nickname match
- Edit distance
- Initials match
- Phonetic matching
- Misplacement of tokens
- Extra tokens
- Missing values
The person Name comparison method includes the following resources:
person_compare_spec_name
– In the generated algorithm, the name format of this resource isrecordType_entityType_ compare_spec_name
. For example:person_person_entity_compare_spec_name
.
Organization name comparisons
For organization names, there is typcally one field that contains the entire business name. That field is compared using primarily the following features:
- Exact match
- Nickname match
- Edit distance
- Initials match
- Phonetic matching
- Misplacement of tokens
- Extra tokens
- Missing values
For organization names, the acronyms and nicknames are also compared for exactness.
The organization name comparison method includes the following resources:
org_compare_spec_name
- In the generated algorithm, the name format of this resource isrecordType_entityType_ compare_spec_name
.
Date comparisons
For dates, there are typically three fields to compare: day, month, and year.
The year
field is compared using the following features:
- Exactness
- Edit distance
- Non-matching
- Missing
The day
and month
fields are compared using the following features:
- Exactness
- Non-matching
- Missing
The date comparator also checks to see if the day
and month
fields have been transposed due to locale differences in date formatting.
The Date comparison method includes the following resources:
compare_spec_date
- In the generated algorithm, the name format of this resource isrecordType_entityType_ compare_spec_date
.
Gender comparisons
The gender attribute is compared using the following features:
- Exactness
- Non-matching
The gender comparison method includes the following resources:
compare_spec_gender
- In the generated algorithm, the name format of this resource isrecordType_entityType_ compare_spec_gender
.
Address comparisons
Different fields within an address attribute are handled differently.
Fields like country, city, province/state, and subdivision are compared using the following features:
- Exactness
- Equivalency
- Edit distance
- Non-matching
- Missing
Postal code fields are compared using the following features:
- Exactness
- Edit distance
- Non-matching
- Missing
Fields like street number, street name, street type, unit number, and direction are compared using the following features:
- Exactness
- Equivalency
- Initials match
- Edit distance
- Non-matching
- Misplacement of tokens
- Missing
The address comparison method includes the following resources:
compare_spec_address
- In the generated algorithm, the name format of this resource isrecordType_entityType_ compare_spec_address
.
Phone comparisons
Phone number attributes are compared using the following features:
- Exact match
- Edit distance
- Non-matching
The phone comparison method includes the following resources:
compare_spec_phone
- In the generated algorithm, the name format of this resource would berecordType_entityType_ compare_spec_phone
.
Identifier comparisons
Identification number attributes are compared using the following features:
- Exact match
- Edit distance
- Non-matching
The identifier comparison method includes the following resources:
compare_spec_identifier
- In the generated algorithm, the name format of this resource isrecordType_entityType_ compare_spec_identifier
.
Email comparisons
Email attributes consist of two parts: the unique ID (before the @ symbol) and the email domain (after the @ symbol). Both the ID and domain parts are compared, separately, using the following features:
- Exact match
- Edit distance
- Non-matching
The outcome of the two comparisons are combined in a weighted manner to produce an overall comparison score.
The email comparison method includes the following resources:
compare_spec_email
- In the generated algorithm, the name format of this resource isrecordType_entityType_ compare_spec_email
.
Edit distance
The IBM Match 360 matching engine calculates edit distance as one of the internal functions during comparison and matching of various attributes. Edit distance is a measurement of how dissimilar two strings are from each other. It is calculated by counting the number of changes required to transform one string into the other.
There are different ways to define edit distance by using different sets of string operations. By default, IBM Match 360 uses a standard edit distance function that is publicly available in literature. As an alternative, you can choose to use a specialized IBM Match 360 edit distance function.
-
The standard edit distance function provides better performance of the matching engine. For this reason, it is the default comparison configuration for all attributes except for the Telephone attribute type.
-
The specialized edit distance function is built for hyper-precision use cases. This option takes into consideration typos or similar-looking characters, such as 8 and B, 0 and O, 5 and S, or 1 and I. When there is a mismatch in two compared values based on similar-looking characters, the assigned dissimilarity measure is less than what would be assigned by a standard edit distance function. As a result, these types of mismatches are not penalized as strongly by the specialized function.
Important: The specialized edit distance function includes some complex calculations. As a result, choosing this option has an impact on system performance during the matching process.
For information about customizing your matching algorithm, including using the API to customize the edit distance, see Customizing and strengthening your matching algorithm.
Learn more
- Data concepts
- Matching your data to create master data entities
- Customizing and strengthening your matching algorithm
Parent topic: Managing master data