Entity extraction
The Watson Natural Language Processing Entity extraction blocks extract entities from input text.
Block name
The Watson Natural Language Processing library offers the following entity extraction blocks:
- For machine-learning-based extraction:
- For rule-based extraction:
Machine-learning-based extraction for general entities
The machine-learning-based extraction model
is trained on labeled data for the more complex entity types such as person, organization and location.entity-mentions_bert_multi_stock
Capabilities
The entity block extract entities from the input text. The following types of entities are recognized:
- Date
- Duration
- Facility
- Geographic feature
- Job title
- Location
- Measure
- Money
- Ordinal
- Organization
- Person
- Time
Capabilities | Examples |
---|---|
Extracts entities from the input text. | -> , , ,
|
Supported languages
Entity extraction is available for the following languages. For a list of the language codes and the corresponding language, see Language codes.
ar, cs, da, de, en, es, fi, fr, he, hi, it, ja, ko, nb, nl, nn, pt, ro, ru, sk, sv, tr, zh-cn
Dependencies on other blocks
The following block must run before you can run the Entity extraction block:
syntax_izumo_<language>_stock
Code sample
import watson_nlp
# Load Syntax Model for English, and the multilingual BERT Entity model
syntax_model = watson_nlp.load(watson_nlp.download('syntax_izumo_en_stock'))
bert_entity_model = watson_nlp.load(watson_nlp.download('entity-mentions_bert_multi_stock'))
# Run the syntax model on the input text
syntax_prediction = syntax_model.run('IBM\'s CEO Arvind Krishna is based in the US')
# Run the entity mention model on the result of syntax model
bert_entity_mentions = bert_entity_model.run(syntax_prediction)
print(bert_entity_mentions)
Output of the code sample:
{ "mentions": [ { "span": { "begin": 0, "end": 3, "text": "IBM" }, "type": "Organization", "producer_id": { "name": "BERT Entity Mentions", "version": "0.0.1" }, "confidence": 0.9944692850112915, "mention_type": "MENTT_UNSET", "mention_class": "MENTC_UNSET", "role": "" }, { "span": { "begin": 6, "end": 9, "text": "CEO" }, "type": "JobTitle", "producer_id": { "name": "BERT Entity Mentions", "version": "0.0.1" }, "confidence": 0.9871304631233215, "mention_type": "MENTT_UNSET", "mention_class": "MENTC_UNSET", "role": "" }, { "span": { "begin": 10, "end": 24, "text": "Arvind Krishna" }, "type": "Person", "producer_id": { "name": "BERT Entity Mentions", "version": "0.0.1" }, "confidence": 0.9988446235656738, "mention_type": "MENTT_UNSET", "mention_class": "MENTC_UNSET", "role": "" }, { "span": { "begin": 41, "end": 43, "text": "US" }, "type": "Location", "producer_id": { "name": "BERT Entity Mentions", "version": "0.0.1" }, "confidence": 0.9911670088768005, "mention_type": "MENTT_UNSET", "mention_class": "MENTC_UNSET", "role": "" } ], "producer_id": { "name": "BERT Entity Mentions", "version": "0.0.1" } }
Machine-learning-based extraction for PII entities
The machine-learning-based extraction model
is trained on labeled data for types where labeled data can be obtained, namely person and location.entity-mentions_bilstm_en_pii
Capabilities
The entity block
recognizes the following types of entities:entity-mentions_bilstm_en_pii
Entity type name | Description | Supported languages |
---|---|---|
Location | All geo-political regions, continents, countries, and street names, states, provinces, cities, towns or islands. | en |
Person | Any being; living, nonliving, fictional or real. | en |
Dependencies on other blocks
The following block must run before you can run the
block:entity-mentions_bilstm_en_pii
syntax_izumo_en_stock
Code sample
import os
import watson_nlp
text = 'Denver is the capital of Colorado. The total estimated government spending in Colorado in fiscal year 2016 was $36.0 billion. IBM office is located in downtown Denver. Michael Hancock is the mayor of Denver.'
# Load rbr model in WatsonNLP
rbr_model = watson_nlp.load(watson_nlp.download('entity-mentions_rbr_en_stock', parent_dir=parent_dir))
# Run rbr model in WatsonNLP
rbr_result = rbr_model.run(text)
print(type(rbr_result))
rbr_result
Output of the code sample:
{ "mentions": [ { "span": { "begin": 102, "end": 106, "text": "2016" }, "type": "Number", "producer_id": { "name": "RBR mentions", "version": "0.0.1" }, "confidence": 0.8, "mention_type": "MENTT_UNSET", "mention_class": "MENTC_UNSET", "role": "" }, { "span": { "begin": 112, "end": 124, "text": "36.0 billion" }, "type": "Number", "producer_id": { "name": "RBR mentions", "version": "0.0.1" }, "confidence": 0.8, "mention_type": "MENTT_UNSET", "mention_class": "MENTC_UNSET", "role": "" } ], "producer_id": { "name": "RBR mentions", "version": "0.0.1" } }
Rule-based extraction for general entities
The rule-based model
identifies syntactically regular entities.entity-mentions_rbr_xx_stock
Capabilities
Rule-based extraction handles syntactically regular entity types. The entity block extract entities from the input text. The following types of entities are recognized:
- PhoneNumber
- EmailAddress
- Number
- Percent
- IPAddress
- HashTag
- TwitterHandle
- URLDate
Capabilities | Examples |
---|---|
Extracts syntactically regular entity types from the input text. | ->
|
Supported languages
Entity extraction is available for the following languages. For a list of the language codes and the corresponding language, see Language codes.
ar, cs, da, de, en, es, fi, fr, he, hi, it, ja, ko, nb, nl, nn, pt, ro, ru, sk, sv, tr, zh-cn, zh-tw
Dependencies on other blocks
None
Code sample
import watson_nlp
# Load a rule-based Entity Mention model for English
rbr_entity_model = watson_nlp.load(watson_nlp.download('entity-mentions_rbr_en_stock'))
# Run the entity model on the input text
rbr_entity_mentions = rbr_entity_model.run('My email is john@us.ibm.com')
print(rbr_entity_mentions)
Output of the code sample:
{ "mentions": [ { "span": { "begin": 12, "end": 27, "text": "john@us.ibm.com" }, "type": "EmailAddress", "producer_id": { "name": "RBR mentions", "version": "0.0.1" }, "confidence": 0.8, "mention_type": "MENTT_UNSET", "mention_class": "MENTC_UNSET", "role": "" } ], "producer_id": { "name": "RBR mentions", "version": "0.0.1" } }
Rule-based extraction for PII entities
The rule-based model
handles the majority of the types by identifying common formats of PII entities and performing possible checksum or validations as appropriate for each entity type. For example,
credit card number candidates are validated using the Luhn algorithm.entity-mentions_rbr_multi_pii
Capabilities
The entity block
recognizes the following types of entities:entity-mentions_rbr_multi_pii
Entity type name | Description | Supported languages |
---|---|---|
BankAccountNumber.CreditCardNumber.Amex | Credit card number for card types AMEX (15 digits). Checked through the Luhn algorithm. | All |
BankAccountNumber.CreditCardNumber.Master | Credit card number for card types Master card (16 digits). Checked through the Luhn algorithm. | All |
BankAccountNumber.CreditCardNumber.Other | Credit card number for left-over category of other types. Checked through the Luhn algorithm. | All |
BankAccountNumber.CreditCardNumber.Visa | Credit card number for card types VISA (16 to 19 digits). Checked through the Luhn algorithm. | All |
EmailAddress | Email addresses, for example: john@gmail.com | ar, cs, da, de, en, es, fi, fr, he, hi, it, ja, ko, nb, nl, nn, pl, pt, ro, ru, sk, sv, tr, zh-cn |
IPAddress | IPv4 and IPv6 addresses, for example,
|
ar, cs, da, de, en, es, fi, fr, he, hi, it, ja, ko, nb, nl, nn, pl, pt, ro, ru, sk, sv, tr, zh-cn |
|
Any specific phone number, for example, 0511-123-456 | ar, cs, da, de, en, es, fi, fr, he, hi, it, ja, ko, nb, nl, nn, pl, pt, ro, ru, sk, sv, tr, zh-cn |
Some PII entity type names are country-specific. The
in the following entity types is a placeholder for a country code._
: These are more variable national bank account numbers and the extraction is mostly language-specific without a general checksum algorithm.BankAccountNumber.BBAN._
: Highly standardized IBANs are supported in a language-independent way and with a checksum algorithm.BankAccountNumber.IBAN._
: These national IDs don’t have a (published) checksum algorithm, and are being extracted on a language-specific basis.NationalNumber.NationalID._
: Checksums are implemented only for the countries where a checksum algorithm exists. These are specifically extracted language with additional context restrictions.NationalNumber.Passport._
: These IDs don't have a (published) checksum algorithm, and are being extracted on a language-specific basis.NationalNumber.TaxID._
Which entity types are available for which languages and which country code to use is listed in the following table.
Country | Entity Type Name | Description | Supported Languages |
---|---|---|---|
Austria |
|
Basic bank account number | de |
|
International bank account number | all | |
|
Passport number | de | |
|
Tax identification number | de | |
Belgium |
|
Basic bank account number | fr, nl |
|
International bank account number | all | |
|
National identification number | fr, nl | |
|
Passport number | fr, nl | |
Bulgaria |
|
Basic bank account number | bg |
|
International bank account number | all | |
|
National identification number | bg | |
Canada |
|
Social insurance number. Checksum algorithm is implemented. | en, fr |
Croatia |
|
Basic bank account number | hr |
|
International bank account number | all | |
|
National identification number | hr | |
|
Tax identification number | hr | |
Cyprus |
|
Basic bank account number | el |
|
International bank account number | all | |
|
Tax identification number | el | |
Czechia |
|
Basic bank account number | cs |
|
International bank account number | cs | |
|
National identification number | cs | |
|
Tax identification number | cs | |
Denmark |
|
Basic bank account number | da |
|
International bank account number | all | |
|
National identification number | da | |
Estonia |
|
Basic bank account number | et |
|
International bank account number | all | |
|
National identification number | et | |
Finland |
|
Basic bank account number | fi |
|
International bank account number | all | |
|
National identification number | fi | |
|
Passport number | fi | |
France |
|
Basic bank account number | fr |
|
International bank account number | all | |
|
Passport number | fr | |
|
Social insurance number. Checksum algorithm is implemented. | fr | |
Germany |
|
Basic bank aAccount number | de |
|
International bank account number | all | |
|
Passport number | de | |
|
Social insurance number. Checksum algorithm is implemented. | de | |
Greece |
|
Basic bank account number | el |
|
International bank account number | all | |
|
Passport number | el | |
|
Tax identification number | el | |
|
National ID number | el | |
Hungary |
|
Basic bank account number | hu |
|
International bank account number | all | |
|
National identification number | hu | |
|
Tax identification number | hu | |
Iceland |
|
Basic bank account number | is |
|
International bank account number | all | |
|
National identification number | is | |
Ireland |
|
Basic bank account number | en |
|
International bank account number | all | |
|
National identification number | en | |
|
Passport number | en | |
|
Tax identification number | en | |
Italy |
|
Basic bank account number | it |
|
International bank account number | all | |
|
National identification number | it | |
|
Passport number | it | |
Latvia |
|
Basic bank account number | lv |
|
International bank account number | all | |
|
National identification number | lv | |
Liechtenstein |
|
Basic bank account number | de |
|
International bank account number | all | |
Lithuania |
|
Basic bank account number | lt |
|
International bank account number | all | |
|
National identification number | lt | |
Luxembourg |
|
Basic bank account number | de, fr |
|
International bank account number | all | |
|
Tax identification number | de, fr | |
Malta |
|
Basic bank account number | mt |
|
International bank account number | all | |
Netherlands |
|
Basic bank account number | nl |
|
International bank account number | all | |
|
National identification number | nl | |
|
Passport number | nl | |
Norway |
|
Basic bank account number | no |
|
International bank account number | all | |
|
National identification number | no | |
|
National identification number old | no | |
|
Passport number | no | |
Poland |
|
Basic bank account number | pl |
|
International bank account number | all | |
|
National identification number | pl | |
|
Passport number | pl | |
|
Tax identification number | pl | |
Portugal |
|
International bank account number | all |
|
Basic bank account number | pt | |
|
National identification number | pt | |
|
National identification number, obsolete format | pt | |
|
Tax identification number | pt | |
Romania |
|
Basic bank account number | ro |
|
International bank account number | all | |
|
National identification number | ro | |
|
Tax identification number | ro | |
Slovakia |
|
International bank account number | all |
|
Basic bank account number | sk | |
|
Tax identification number | sk | |
|
National identification number | sk | |
Slovenia |
|
International bank account number | all |
Spain |
|
International bank account number | all |
|
Basic bank account number | es | |
|
National identification number | es | |
|
Passport number | es | |
|
Tax identification number | es | |
Sweden |
|
International bank account number | all |
|
Basic bank account number | sv | |
|
National identification number | sv | |
|
Passport number | sv | |
Switzerland |
|
International bank account number | all |
|
Basic bank account number | de, fr, it | |
|
National identification number | de, fr, it | |
|
Passport number | de, fr, it | |
|
National identification number, obsolete format | de, fr, it | |
United Kingdom of Great Britain and Northern Ireland |
|
International bank account number | all |
|
National Health Service number | all | |
|
National Social Security Insurance number | all | |
|
National ID number, obsolete format | all | |
|
Passport Number. Checksum algorithm is not implemented and hence come with additional context restrictions. | all | |
United States |
|
Social Security number. Checksum algorithm is not implemented and hence come with additional context restrictions. | en |
|
Passport Number. Checksum algorithm is not implemented and hence come with additional context restrictions. | en |
Dependencies on other blocks
None
Code sample
import watson_nlp
# Load the RBR PII model. Note that this is a multilingual model supporting multiple languages.
rbr_entity_model = watson_nlp.load(watson_nlp.download('entity-mentions_rbr_multi_pii'))
# Run the RBR model. Note that language code of the input text is passed as a parameter to the run method.
rbr_entity_mentions = rbr_entity_model.run('Please find my credit card number here: 378282246310005. Thanks for the payment.', language_code='en')
print(rbr_entity_mentions)
Output of the code sample:
{ "mentions": [ { "span": { "begin": 40, "end": 55, "text": "378282246310005" }, "type": "BankAccountNumber.CreditCardNumber.Amex", "producer_id": { "name": "RBR mentions", "version": "0.0.1" }, "confidence": 0.8, "mention_type": "MENTT_UNSET", "mention_class": "MENTC_UNSET", "role": "" } ], "producer_id": { "name": "RBR mentions", "version": "0.0.1" } }
Parent topic: Watson Natural Language Processing block catalog