CSV file for term assignment based on rules
Create a CSV file with the name ikc-term-assignment-rules.csv
that defines the rules for term assignment and upload it to the project. The CSV file must conform to formatting rules.
General formatting rules
The CSV file must comply with the Common Format and MIME Type for comma-separated values (CSV) Files and must be encoded in UTF-8.
Limitations
The maximum recommended size of the CSV import file is 50 MB.
Header row
The header row of the CSV file represents the properties that make up the rule and the action to take.
Follow these guidelines for the header row:
- The header row must be the first row in the file and must not be repeated.
- Separate column names with a comma. If you create the file in a spreadsheet editor, the commas are added automatically when you save the file in CSV format.
- The header row must include the mandatory columns for the rule.
- You can omit any optional columns.
- You can add arbitrary other columns, which will be ignored.
- Use the exact column names in the header row. Column names are case-sensitive.
- Make sure the column names do not include extra white space characters. White space characters might be added by a spreadsheet or text editor, but not be visible. If you receive an import error that the column names are incorrect, even though your columns are spelled and capitalized correctly, check for white spaces.
Column specification
To delimit values for different columns, use a comma. If you create the file in a spreadsheet editor, the commas are added automatically when you save the file in CSV format.
To omit a value for a column, use a comma directly after the previous comma and without any other characters. For example, two consecutive commas indicate that the second column is empty.
To enclose fields, use double quotation marks ("
).
Term category paths
You must specify the full category path for a term. To delimit the category path, use two greater-than (>>) symbols between each level of the category hierarchy and between the category path and the artifact name. If you start the path
with >>
, the root category is [uncategorized].
Rule columns
The CSV file can contain mandatory and optional columns.
To define the rule condition, include these columns:
OBJECT_TYPE
-
The type of object where terms should be assigned. Valid values:
asset
column
This column is mandatory and must not be empty.
PROPERTY
-
The property to match. Valid values:
name
description
mostfreqvalues
Any of the most frequent values of the data profile. Rules with this property require data profiling before the rule can be properly applied.OBJECT_TYPE
must becolumn
.dataclassname
The name of the data class that is assigned to a column.OBJECT_TYPE
must becolumn
.assetid
The ID of the data asset.
This column is mandatory and must not be empty.
MATCH_STRING
-
The string to match against the property. You can set any value. This column is mandatory and must not be empty.
MATCH_TYPE
-
Describes how the match string should be matched against the property. This column is mandatory and must not be empty. Valid values:
equals
Case-insensitive exact match.equalscs
Case-sensitive exact match.contains
Match if the property contains the match string. Matching is case-insensitive.containscs
Match if the property contains the match string. Matching is case-sensitive.
To define which terms to assign with which confidence, include these columns:
TERM_NAME
-
The name of the term including the category path as described in Term category path. For example,
Category 1 >> Category2 >> MyTerm
.Either
TERM_NAME
orTERM_ID
must be present. You can specify both. In that case,TERM_ID
takes precedence. If you plan to use the rules file in different systems with similar terms and category hierarchies, use term names instead of term IDs. TERM_ID
-
The ID of the term. You can use the artifact ID or the global ID.
Either
TERM_NAME
orTERM_ID
must be present. You can specify both. In that case,TERM_ID
takes precedence. If you plan to use the rules file in different systems with similar terms and category hierarchies, use term names instead of term IDs. CONFIDENCE
-
A float value between 0 and 1 that indicates the confidence to assign. The default value is 1.0 (=100%). Independent of the locale, the decimal point is
.
Additional columns that you can include:
ACTIVE
-
If you set the value
no
, the rule is not considered during assignment. During development, you might want to disable certain rules without removing them from the CSV file. GROUP
-
A group of rules that allows you to set up more complex assignment rules, such as,
If a column name contains X and its description contains Y, then assign term T1 and T2.
At least one condition and one action must be defined per rule group.
Rule file options
You can supply additional options to influence how rules are applied in the description field of the uploaded rule file. Add lines in the format <option-name>=<option-value>
. The description field can contain any other
text as well.
default_confidence_if_missing
-
A float value between 0 and 1 that indicates a default confidence other than 1.0 if the
CONFIDENCE
column is empty. use_expanded_names
-
Defines when a generated name should also be considered when rules are evaluated. This option is valid only if gen AI based enrichment capabilities are enabled in IBM Knowledge Catalog Standard or IBM Knowledge Catalog Premium.
Possible values:
NEVER
: Do not consider generated names.SUGGESTED
: Consider a suggested generated name.ACCEPTED
: Consider an assigned generated name.
Default value is
ACCEPTED
. use_generated_descriptions
-
Defines when a generated description should also be considered as a description when rules are evaluated. This option is valid only if gen AI based enrichment capabilities are enabled in IBM Knowledge Catalog Standard or IBM Knowledge Catalog Premium.
Possible values:
NEVER
: Do not consider generated descriptionsSUGGESTED
: Consider a suggested generated description.ACCEPTED
: Consider an assigned generated description.
Default value is
ACCEPTED
.
Examples
Rule examples
The following example describes three rules:
- If a column has a name that contains the string
address
, assign termpersonal data
with 100% confidence. 100% is the default if theCONFIDENCE
column is empty. - If a column has a name that contains the string
customer
, assign termdata subject
with 90% confidence. - If an asset has a description that contains string
client
, also assign termdata subject
, but with 100% confidence.
The term names are written as a path in the category tree: GDPR
is a root category that contains the terms personal data
and data subject
.
The COMMENT
column contains additional information about the rule but does not affect term assignment.
OBJECT_TYPE | PROPERTY | MATCH_TYPE | MATCH_STRING | TERM_NAME | CONFIDENCE | COMMENT |
---|---|---|---|---|---|---|
column | name | contains | address | GDPR >> personal data | Address is personal data | |
column | name | contains | customer | GDPR >> data subject | 0.9 | Customers are data subjects |
asset | description | contains | client | GDPR >> data subject | Clients are data subjects |
Rule group example
The following example shows a rule group G1
that joins two conditions and a rule group G2
that defines two terms to be assigned for one condition:
G1
: If a column's name containsaddress
and its description containsidentifier
then assign termonline identifier
with confidence 92%.G2
: If a column haspostfach
("P.O. Box" in German) as one of its most frequent values then assign termEuropean Union
with 90% confidence and termdata subject
with 95% confidence.
OBJECT_TYPE | PROPERTY | MATCH_TYPE | MATCH_STRING | TERM_NAME | CONFIDENCE | GROUP |
---|---|---|---|---|---|---|
column | name | contains | address | G1 | ||
column | description | contains | identifier | GDPR >> online identifier | 0.92 | G1 |
column | mostfreqvalues | contains | postfach | GDPR >> European Union | 0.9 | G2 |
GDPR >> data subject | 0.95 | G2 |
Sample rule file description
The following example is a valid rule file description:
This the best rule file in the world.
default_confidence_if_missing = 0.95
use_expanded_names = ACCEPTED
use_generated_descriptions = SUGGESTED
Closing remarks.
Parent topic: Default enrichment settings