Data governance tutorial: Consume your data
Take this tutorial to work with your high quality and protected data after completing the Curate high quality data tutorial and Protect your data tutorial with the Data governance use case of the data fabric trial. Your goal is to evaluate, share, shape, and analyze data in the data fabric.
The story for the tutorial is that Golden Bank has several departments that need access to high-quality customer mortgage data. As a Data Analyst, you will need to search for and find the right data, understand and trust its content, and then prepare it for other data analysts and data scientists to use.
The following animated image provides a quick preview of what you’ll accomplish by the end of this tutorial where you will view catalog assets, manually enrich assets and create relationships, visualize data, and filter data to improve quality. Click the image to view a larger image.
Preview the tutorial
In this tutorial, you will complete these tasks:
- Set up the prerequisites.
- Task 1: Understand data assets.
- Task 2: Enrich assets and create relationships.
- Task 3: Add enriched data to a project.
- Task 4: Visualize the data.
- Task 5: Prepare the data for analytics and AI.
- Cleanup (Optional)
Watch this video to preview the steps in this tutorial. There might be slight differences in the user interface shown in the video. The video is intended to be a companion to the written tutorial.
This video provides a visual method to learn the concepts and tasks in this documentation.
Tips for completing this tutorial
Here are some tips for successfully completing this tutorial.
Use the video picture-in-picture
The following animated image shows how to use the video picture-in-picture and table of contents features:
Get help in the community
If you need help with this tutorial, you can ask a question or find an answer in the Cloud Pak for Data Community discussion forum.
Set up your browser windows
For the optimal experience completing this tutorial, open Cloud Pak for Data in one browser window, and keep this tutorial page open in another browser window to switch easily between the two applications. Consider arranging the two browser windows side-by-side to make it easier to follow along.
Set up the prerequisites
Complete prerequisite tutorials
To preview this task, watch the video beginning at 00:39.
Complete the Curate high quality data and Protect your data tutorials:
- Curate high quality data tutorial to import and enrich data assets and publish them to a catalog.
- Protect your data tutorial to create data protection rules and masking flows to protect data.
Base Premium Standard Unless otherwise noted, this information applies to all editions of IBM Knowledge Catalog.
Task 1: Understand data assets
To preview this task, watch the video beginning at 01:12.
Data assets in catalogs are much more than pointers to data. They contain information about the format and meaning of the data and statistics about the data values. Follow these steps to understand the value of data assets:
-
From the Navigation Menu , choose Catalogs > View all catalogs.
-
Open the Mortgage Approval Catalog.
-
The featured assets section shows Recently added assets, assets that Recommended which are suggested assets from AI and machine learning based on your past usage and popularity, and Highly rated assets that catalog collaborators rated and reviewed.
-
Click Hide featured assets to close that section.
-
Search for
mortgage
. -
Click MORTGAGE_APPLICANTS_TRUST to view that catalog asset. The Overview tab and the side panel provide basic information about the asset such as the description, a rating, tags, where the asset is located, business terms, data classes, and related items.
-
Click the Profile tab. The profile information helps you understand the content, the quality, and usability of the data.
-
Scroll to the right to locate the ZIP_CODE column.
-
The data class that was automatically assigned to the ZIP_CODE column is Commercial and Government Entity. Note that the automatically assigned data class may vary. Since the values are zip codes, you can easily reclassify this column. Click the drop-down list to see other possible data classes and their confidence levels. Select US Zip Code.
-
Click the Asset tab to see a preview of the data.
-
Return to the Overview tab to see more metadata about the columns. In the list of columns, search for the EMPLOYMENT_STATUS column to see the metadata including the assigned business terms.
Check your progress
The following image shows the MORTGAGE_APPLICANTS_TRUST asset in the catalog. You explored the type of information that IBM Knowledge Catalog automatically adds to data assets during metadata enrichment. In the next task, you will manually
enrich this data asset.
Task 2: Enrich assets and create relationships
To preview this task, watch the video beginning at 02:49.
You can make assets more valuable by adding information to them. For example, you can add your opinion of the asset, update asset properties, and create relationships to link assets. Follow these steps to enrich assets and create relationships:
-
For the MORTGAGE_APPLICANTS_TRUST catalog asset, click the Review tab. Rate and comment on this asset so that others can find the asset easily.
-
Select 5 stars for the rating.
-
For the review, copy and paste the following text:
This contains high quality customer data from the mortgage system.
-
Click Submit.
-
-
Click the Overview tab.
-
Click the Edit icon next to the asset name to edit the asset name.
-
Change the name to:
MORTGAGE_APPLICANTS_TRUST_PROTECT
-
Click Apply.
-
-
In the Description section in the right side panel, click the Add icon .
Note:If this asset has an existing description, you will see an Edit icon instead of an Add icon.
-
Copy and paste the following description:
Mortgage applicants from the Mortgage System
-
Click Apply.
-
-
Because this asset relates to mortgage loans, next to Business terms, click the Add icon or the Edit icon .
-
In the Search field, type
loan
.Note: It is not necessary to press Enter after typing the search term. You will see a list of results immediately after typing the search term. -
Select Loan.
-
Click Save.
-
-
Because this asset contains personal information, next to Classifications, click Add icon or the Edit icon .
-
Select Personally Identifiable Information.
-
Click Save.
-
-
Because this asset is related to other mortgage assets, next to Related items, click Add related items > Add related assets.
-
Select Is related to, and click Next.
-
Select the CREDIT_SCORE and MORTGAGE_APPLICATION assets, and click Add.
-
-
Click MORTGAGE_APPLICATION to view that related asset.
Check your progress
The following image shows the Overview tab for the MORTGAGE_APPLICANTS_TRUST_PROTECT asset in the catalog. You made these assets more valuable by reviewing, updating properties, and adding relationships to the assets. In the next task,
you will add the enriched asset to a project.
Task 3: Add enriched data to a project
To preview this task, watch the video beginning at 04:09.
The data analysts team needs the mortgage applicants data in the mortgage analysis project to refine, visualize, analyze, and use as training data for models. Follow these steps to add the enriched data to a project:
-
Click Mortgage Approval Catalog in the navigation trail.
-
At the end of the MORTGAGE_APPLICANTS_TRUST_PROTECT catalog asset row, click the Overflow menu , and choose Add to project.
-
In the Target drop down list, select the Data governance project.
-
Click Add.
-
-
When the notification displays, click Go to project. If you miss the notification, then:
-
Click the Navigation Menu , choose Projects > View all projects.
-
Click the Data governance project.
-
-
In the project, click the Assets tab to see the MORTGAGE_APPLICANTS_TRUST_PROTECT data asset.
Check your progress
The following image shows the MORTGAGE_APPLICANTS_TRUST_PROTECT asset in the project. Now you are ready to visualize the data.
Task 4: Visualize the data
To preview this task, watch the video beginning at 04:39.
You need to cleanse and refine the mortgage applicants data to get it ready for your analytical tools and models. A quick and easy way to determine how it needs to be shaped is to visualize the data in Data Refinery. The visualization is based on the first 5,000 rows of the data. Follow these steps to visualize the data:
-
Click the MORTGAGE_APPLICANTS_TRUST_PROTECT data asset to preview the data.
-
Click Prepare data to open the data asset in Data Refinery, and wait for the data to be read and processed.
-
In the About this asset panel, click the X to close the panel.
-
In the Steps panel, click the X to close the panel.
-
Click the Visualizations tab.
-
For the Column to visualize, select EMPLOYMENT_STATUS.
-
Click Visualize data. The tool selects a pie chart as the best chart type for this column, which shows the distribution of applicants by employment status. Notice the suggested chart types that are indicated by a blue dot next to bar, word cloud, and sunburst.
-
For the Chart type, select the Bubble chart type. The Bubble chart is one easy way to quickly visualize the distribution of values in a particular data set.
-
From the Chart type drop-down, select the Relationship chart type.
-
This chart type requires two columns. Select these columns:
-
For the first column, select EMPLOYMENT_STATUS.
-
Click Add another column.
-
For the second Column, select EDUCATION.
-
-
With the Relationship chart, you can select endpoints to see the relationships. For example, you can see applicants employment status by level of education.
Check your progress
The following image shows the MORTGAGE_APPLICANTS_TRUST_PROTECT asset visualized in Data Refinery. You are now ready to cleanse the data.
Task 5: Prepare the data for analytics and AI
To preview this task, watch the video beginning at 05:59.
You can't process applicants without a social security number, so you need to review the data and remove any applicants without social security numbers. To prepare the MORTGAGE_APPLICANTS_TRUST_PROTECT data, you will:
- View the frequency of values in the Social_Security_Number column.
- Filter the applicants with missing values from the Social_Security_Number column.
Follow these steps to prepare the data:
-
In the Data Refinery, click the Profile tab.
-
Scroll to the right to locate the Social_Security_Number column. Notice several missing values.
-
Click the Data tab to filter out these records. In the status bar at the bottom of the screen, Data Refinery indicates that the FULL DATA SET is 1101 rows.
-
If the Steps panel is not visible, click Steps to open the panel.
-
Click New step.
-
In the Cleanse section, select Filter.
-
In the Column field, select the Social_Security_Number column.
-
In the Operator field, select Is not empty.
-
Click Apply. Notice in the status bar at the bottom of the screen, Data Refinery now indicates that the FULL DATA SET is 1000 rows because the rows with missing Social Security Numbers are filtered out. Notice that a new step displays in the Steps panel showing the Filter operation.
-
-
Click the Profile tab.
-
Scroll to the right to locate the Social_Security_Number column. Notice that the missing values are gone.
-
From the toolbar, click the Save icon .
-
From the toolbar, click the Export icon, and choose Export current data to CSV.
-
Save the MORTGAGE_APPLICANTS_TRUST_PROTECT_shaped.csv to a local folder.
-
Navigate to that folder, and open the CSV file, which contains 1000 rows and no applicants are missing the social security number.
-
-
Return to Cloud Pak for Data, and click the Data governance project in the navigation trail.
-
Click All assets, and locate the new Data Refinery flow asset with the name MORTGAGE_APPLICANTS_TRUST_PROTECT_flow.
Check your progress
The following image shows the MORTGAGE_APPLICANTS_TRUST_PROTECT_shaped.csv file that you refined in Data Refinery. This data set contains the information about those mortgage applicants who provided a social security number.
As a Data Analyst for Golden Bank, you learned how to search for and find the right data, understand and trust its content, and then prepare it for other data analysts and data scientists to use.
Cleanup (Optional)
If you would like to retake the tutorials in the Data governance use case, delete the following artifacts.
Artifact | How to delete |
---|---|
Imported business terms | Delete governance artifacts |
Banking category | Delete a category |
Data protection rules: Confidential Information and Redact Social Security Number | Delete data protection rules |
Mortgage Approval Catalog | Delete a catalog |
Data governance sample project | Delete a project |
Next steps
-
Try the Govern virtualized data tutorial.
-
Try the Configure a 360-degree view tutorial.
-
Sign up for another Data fabric use case.
Learn more
Parent topic: Use case tutorials