Designing metadata imports
When you import metadata, you must decide what type of metadata to import, the import target and scope, whether to schedule import jobs, and how you want to customize the import behavior.
- Import goals
- Import target
- Data source
- Scope of import
- Scheduling options
- Lineage import phases
- Advanced import options
Import goals
The first step when you import metadata is to define the import goals. You must decide which type of metadata to import and whether you want to work with the imported assets in a project or publish them directly into a catalog.
Typically, metadata import is part of a larger data curation plan. For example, after you import metadata for data assets, you can add business metadata to your imported data assets by running metadata enrichment. You can also run data quality rules. Finally, you can publish the completed data assets to a catalog to share with your organization. Before you design your metadata import, make sure that you understand the implications of your choices to your overall curation plan. See Planning for curation.
For example, a typical curation process for data assets includes the following tasks:
- Run metadata import with the Import asset metadata option to add data assets to a project.
- Run metadata enrichment on the data assets to profile your data, to do basic data quality analysis, and to provide business context through term assignment.
- Run data quality rules on the assets.
- Publish the assets to a catalog.
- Run metadata import for the same data assets with the Import lineage metadata option to add lineage information to those assets in the catalog.
You can add other types of assets directly to a catalog because metadata enrichment and data quality assessment are not applicable. You can choose both Import asset metadata and Import lineage metadata options to simultaneously import technical and lineage metadata for assets while you add those assets to a catalog.
You can choose from the following import methods:
- Import asset metadata
- Asset technical metadata provides information for asset details, relationships, and the preview of assets. You can either add it to a project for further processing, or you can publish it in a catalog immediately after the import.
- Import lineage metadata
- Lineage metadata provides information about the flow of data, where it comes from, how it changes, and where it moves over time. Lineage metadata is stored in lineage repository.
Before you can import lineage metadata, you must configure data lineage. For more information, see Configuring data lineage.
Import target
You can import metadata into the project that you're working in or to any catalog where you have an editor or admin role.
Projects
In projects, you can run metadata enrichment and data quality rules on data assets. You publish the imported data assets to a catalog after you are satisfied with their business metadata assignments and data quality.
Lineage information is available in catalogs and projects. Lineage information is only available in projects if the assets have lineage imported using Metadata import.
If your project is marked as sensitive, you can import metadata only to the project, not to a catalog. For more information, see Marking a project as sensitive.
Catalogs
If you know the contents of the data assets well, and you do not want to run metadata enrichment or data quality rules, you can import their metadata directly into the catalog. After the import is completed, assets are publicly available in the selected catalog.
You can import metadata to any catalog for which you have an editor or admin role, except when the catalog is a part of a project that is marked as sensitive.
If you import to a catalog, make sure that the target catalog has duplicate asset handling set to update the original assets instead of to allow duplicate assets. See Duplicate asset handling.
If you want data protection rules to be enforced on the imported data assets, you must select a governed catalog as the import target.
Data source
For the list of supported data sources, see Supported data sources for curation and data quality.
To connect to the data source, you must specify the following details:
-
Data source definition. It is required when you import lineage metadata, and optional when you import asset metadata. It is used to uniquely identify a data source by using endpoints. Endpoints include information such as the hostname or IP address, the port number, and the database name or instance identifier. For example, when you have several Microsoft SQL Server databases, data source definition identifies one of them. Or when your Teradata cluster contains several nodes with various hostnames, data source definition identifies the whole cluster as one entity. For more information, see Creating a data source definition.
-
Scanner. It is used to extract and process metadata to create lineage. You select a scanner when the data source from which the lineage is imported can host metadata of multiple technologies. For example, Microsoft SQL Server can be used as a metadata storage for Microsoft SQL Server Integration Services. In such case, lineage metadata can be imported from the database (Microsoft SQL Server) or from ETL jobs (Microsoft SQL Server Integration Services). You select a scanner to import the specific type of lineage metadata.
-
Connection. Connection details include credentials. You can create many connections for one data source, for example to connect by using different hostnames, or to connect to various user accounts with specific privileges. Details required to connect to a specific data source are described in each connection topic in the Connectors section. When you import asset metadata, you must select either a data source definition or a connection.
Scope of import
Decide what scope of data you want to import. Depending on the size and contents of your data source, you might not want to import all assets but a selected subset. You can include complete schemas or folders, or drill down to individual tables or files. When you select a schema or a folder, you can immediately see how many items it contains. Thus, you can decide whether you want to include the whole set or whether a subset serves your purpose better.
You can't import data from schemas where the name contains special characters.
Inclusion and exclusion lists for lineage metadata
When you define a scope to extract lineage metadata, you can add a list of assets to include in extraction or exclude from extraction. This list is usually a regular expression and its format is specific to the selected data source. For details, see a specific connection topic in the Connectors section.
External inputs
When you import lineage metadata, you can provide additional manual inputs for some data sources so that the final lineage contains more complete data. You have the following options:
- Add inputs from file
- You usually add a .zip file with a structure that meets the requirements of a specific data source. The structure requirements are explained in detail in each connection topic in the Connectors section.
- Ingest metadata from external agents
- You can connect manually to an agent file system or to a Git repository. Assets are then downloaded and used in the metadata extraction.
Placeholder replacements
When you add external inputs for lineage, you can replace placeholder values such as environment variables with real values to use for lineage analysis. The following table contains examples of how the display of data can be modified for lineage analysis.
Replacement scope | Scope processing format | Placeholder value | Replacement value |
---|---|---|---|
(Regular expression is not selected, plain text is used) | ${table_name} | customers | |
.*bteq | Regular expression | ${db} | dwh |
Another way to provide placeholder replacements is by creating a CSV file and adding it to the .zip file that you upload as an external input. This file must be named replace.csv
and it must have the following structure:
"PLACEHOLDER","REPLACEMENT_VALUE"[,SCOPE]
Where:
PLACEHOLDER
is the value that you want to replace.REPLACEMENT_VALUE
is the new value that replaces the original value.SCOPE
is a filter to apply the replacement only on the selected assets. This column is optional. It is interpreted as a regular expression. The example path that can be used in this file is\MyBD\MySchema\MyScript.sql
.
Each replacement pair must be placed on a separate line. Each value must be enclosed in double quotation marks("").
Scheduling options
If you don't set a schedule, you run the import when you initially save the metadata import asset. You can rerun the import manually at any time.
If you select to run the import on a specific schedule, define the date and time you want the job to run. You might want to coordinate scheduled metadata import and the corresponding metadata enrichment jobs for the same assets.
If you select to run the import on a specific schedule, define the date and time you want the job to run. You can schedule single and recurring runs. If you schedule a single run, the job runs exactly one time at the specified day and time. If you schedule recurring runs, the job runs for the first time at the timestamp that is indicated in the Recurrence section.
The default name of the import job is metadata_import_name job. When you set up the metadata import, you can change the name to fit your naming schema. However, you can't change the name later. You can access the import job that you create from within the metadata import asset or from the project's Jobs page. See Jobs.
You can update the schedule of a metadata import by editing the metadata import asset.
Lineage import phases
Lineage metadata import is a process that has various phases. To optimize the import for your needs, you can decide which phases to run with each metadata import job. For example, you can run only the extraction phase on the selected connections that were refreshed recently to improve performance. After this phase is completed, you can run the analysis on everything — the refreshed connections, and those that were previously extracted.
The following list provides a brief explanation about what processes are run in each lineage import phase:
- Dictionary extraction
- Extracts and imports lineage assets (tables, views, synonyms, and other) into the lineage repository.
- Transformations extraction
- Extracts definitions of transformations from the data source.
- Extracted inputs analysis
- Analyzes data lineage for automatically extracted transformations.
- External inputs ingestion
- Ingests external inputs from an agent file system or a Git repository.
- External input analysis
- Analyzes data lineage for external inputs that were ingested or uploaded by a metadata import job.
Advanced import options
You can customize the general import behavior and what happens to imported assets when you rerun a metadata import.
Import asset metadata options
- Prevent specific properties from being updated
- By default, all asset properties are updated when assets are reimported. If you don't want the asset names, asset descriptions, or any column descriptions to be updated on reimport, clear the respective checkboxes on the Update on reimport list.
- Delete existing assets that are not included in the reimport
- By default, no assets are deleted from the target project or catalog when you rerun the import. To clean up the target project or catalog, select from the Delete on reimport options.
- Asset not found in the data source or excluded from import: In these cases, delete previously imported assets from the import target when the import is rerun:
- The asset is no longer available in the data source.
- The Exclude from import setting changed for the rerun, so that the asset is now excluded from import (applicable only for metadata imports that you run on relational databases).
- Asset removed from the import scope: Delete assets that were removed from the scope of this metadata after the last run from the import target when the import is rerun.
- Asset not found in the data source or excluded from import: In these cases, delete previously imported assets from the import target when the import is rerun:
- Do not import specific types of relational assets
-
For metadata imports that you run on relational databases, in the Exclude from import setting, you can select whether you want to import all types of relational assets or whether you want to exclude tables, or views, aliases, and synonyms. These options are mutually exclusive.
- Import additional asset properties
-
For metadata imports that you run on relational databases, you can select whether primary and foreign keys that might be defined in the database are imported.
- Enable additional import options
-
Enable incremental imports to import only new or modified data assets when you rerun the import. This option is available only for metadata imports that you run on relational databases and where the selected data source supports incremental imports:
- Amazon RDS for Oracle
- IBM Db2
- IBM Db2 Big SQL
- IBM Db2 on Cloud
- IBM Netezza Performance Server
- IBM Data Virtualization
- Microsoft Azure SQL Database
- Microsoft SQL Server
- Oracle
- Teradata
Updating or removing the description of an asset in the data source does not change the asset's modification date. The modification date also doesn't change for assets that are removed from the list of imported assets. Therefore, such assets are not considered for incremental imports. In addition, assets that are deleted from the data source or from the scope are not detected with incremental imports. Thus, such assets are not marked as Removed or deleted as specified with the Delete on reimport settings. To see such changes reflected, disable incremental imports to reimport all assets in the data scope.
Important:Incremental imports might not work if the data source and the Cloud Pak for Data client workstation are in different time zones. If the client is in a time zone that is ahead of the data source's time zone, the metadata import job might not detect assets that were added or modified after the last import run. In this case, disable incremental import so that all assets are included when you rerun the import.
For incremental imports to work, the data source must be in the GMT time zone regardless of the client's time zone. - Collect metadata from database catalog
-
For metadata imports that you run on relational databases, you can choose to import metadata from the database catalog. Thus, the user who runs the import needs access only to the database catalog but doesn't need to have SELECT permission on the actual data. The imported assets cannot be profiled or used in metadata enrichment.
- Import asset timestamp
-
You can include the information about the time when the asset was last modified. The
metadata_modification_token
attribute is added to theextended_metadata
property of an asset.
Import lineage metadata options
Advanced options for lineage depend on the data source that you select. For details, see a specific connection topic in the Connectors section.
Learn more
Parent topic: Importing metadata