Data integration tutorial: Orchestrate an AI pipeline with data integration
Take this tutorial to create an end-to-end pipeline to deliver concise, pre-processed, and up-to-date data stored in an external data source with the data fabric trial. Your goal is to use Orchestration Pipelines to orchestrate that end-to-end workflow to generate automated, consistent, and repeatable outcomes. The pipeline uses DataStage and AutoAI, which automates several aspects for a model building process such as, feature engineering and hyperparameter optimization. AutoAI ranks candidate algorithms, and then selects the best model.
The story for the tutorial is that GoldenBank wants to expand its business by offering special low-rate mortgage renewals for online applications. Online applications expand the bank’s customer reach and reduce the bank’s application processing costs. The team will use Orchestration Pipelines to create a data pipeline that delivers up-to-date data on all mortgage applicants, that lenders can use for decision making. The data is stored in Db2 Warehouse. You need to prepare the data because it is potentially incomplete, outdated, and might be obfuscated or entirely inaccessible due to data privacy and sovereignty policies. Then, the team needs to build a mortgage approval model from trusted data, and then deploy and test the model in a pre-production environment.
The following animated image provides a quick preview of what you’ll accomplish by the end of this tutorial. You will edit and run a pipeline to build and deploy a machine learning model. Click the image to view a larger image.
Preview the tutorial
In this tutorial, you will complete these tasks:
- Set up the prerequisites.
- Task 1: View the assets in the sample project.
- Task 2: Explore an existing pipeline.
- Task 3: Add a node to the pipeline.
- Task 4: Run the pipeline.
- Task 5: View the assets, deployed model, and online deployment.
- Cleanup (Optional)
Watch this video to preview the steps in this tutorial. There might be slight differences in the user interface that is shown in the video. The video is intended to be a companion to the written tutorial.
This video provides a visual method to learn the concepts and tasks in this documentation.
Tips for completing this tutorial
Here are some tips for successfully completing this tutorial.
Use the video picture-in-picture
The following animated image shows how to use the video picture-in-picture and table of contents features:
Get help in the community
If you need help with this tutorial, you can ask a question or find an answer in the Cloud Pak for Data Community discussion forum.
Set up your browser windows
For the optimal experience completing this tutorial, open Cloud Pak for Data in one browser window, and keep this tutorial page open in another browser window to switch easily between the two applications. Consider arranging the two browser windows side-by-side to make it easier to follow along.
Set up the prerequisites
Sign up for Cloud Pak for Data as a Service
You must sign up for Cloud Pak for Data as a Service and provision the necessary services for the Data integration use case.
- If you have an existing Cloud Pak for Data as a Service account, then you can get started with this tutorial. If you have a Lite plan account, only one user per account can run this tutorial.
- If you don't have a Cloud Pak for Data as a Service account yet, then sign up for a data fabric trial.
Watch the following video to learn about data fabric in Cloud Pak for Data.
This video provides a visual method to learn the concepts and tasks in this documentation.
Verify the necessary provisioned services
To preview this task, watch the video beginning at 00:37.
Follow these steps to verify or provision the necessary services:
-
From the Navigation Menu , choose Services > Service instances.
-
Use the Product drop-down list to determine whether an existing Watson Studio service instance exists.
-
If you need to create a Watson Studio service instance, click Add service.
-
Select Watson Studio.
-
Select the Lite plan.
-
Click Create.
-
-
Wait while the Watson Studio service is provisioned, which might take a few minutes to complete.
-
Repeat these steps to verify or provision the following additional services:
- Watson Machine Learning
- DataStage
- Cloud Object Storage
Check your progress
The following image shows the provisioned service instances:
Create the sample project
To preview this task, watch the video beginning at 01:14.
If you already have the sample project for this tutorial, then skip this task. Otherwise, follow these steps:
-
Access the Orchestrate an AI pipeline sample project in the Resource hub.
-
Click Create project.
-
If prompted to associate the project to a Cloud Object Storage instance, select a Cloud Object Storage instance from the list.
-
Click Create.
-
Wait for the project import to complete, and then click View new project to verify that the project and assets were created successfully.
-
Click the Assets tab to see the connection, DataStage flows and data definition, and the pipeline.
Check your progress
The following image shows the Assets tab in the sample project. You are now ready to start the tutorial.
Associate the Watson Machine Learning service with the sample project
To preview this task, watch the video beginning at 02:04.
You will use Watson Machine Learning to create and deploy the model, so follow these steps to associate your Watson Machine Learning service instance with the sample project.
-
In the Orchestrate an AI pipeline project, click the Manage tab.
-
Click the Services and Integrations page.
-
Click Associate service.
-
Check the box next to your Watson Machine Learning service instance.
-
Click Associate.
-
Click Cancel to return to the Services & Integrations page.
Check your progress
The following image shows the Services and Integrations page with the Watson Machine Learning service listed. You are now ready to create the sample project.
Task 1: View the assets in the sample project
To preview this task, watch the video beginning at 02:26.
The sample project includes several assets including a connection, data definition, two DataStage flows, and a pipeline. Follow these steps to view those assets:
-
Click the Assets tab in the Orchestrate an AI pipeline project, and then view All assets.
-
All of the data assets that are used in the DataStage flows and the pipeline are stored in a Data Fabric Trial - Db2 Warehouse connection in the AI_MORTGAGE schema. The following image shows the assets from that connection:
-
The Integrate Mortgage Data DataStage flow integrates data about each mortgage applicant, including personally identifiable information, with their application details, credit scores, status as a commercial buyer, and finally the prices of each applicant’s chosen home, and then creates a sequential file with the name
Mortgage_Data.csv
in the project containing the joined data. The following image shows the Integrate Mortgage Data DataStage flow.Tip: If you don't see any DataStage flows, then go back to view your service instances to verify your DataStage instance provisioned successfully. See Provision the necessary services. -
The Integrate Mortgage Approvals DataStage flow uses the output from the first DataStage flow (
Mortgage_Data.csv
) and further enriches the data by integrating information about each mortgage application approval. The resulting data set is saved to the project with the nameMortgage_Data_with_Approvals.csv
. The following image shows the Integrate Mortgage Approvals DataStage flow: -
The Definition_Mortgage_Data data definition for the
Mortgage_Data_with_Approvals.csv
data asset is created by the Integrate Mortgage Approvals DataStage flow. The following image shows the data definition:
Check your progress
The following image shows all of the assets in the sample project. You are now ready to explore the pipeline in the sample project.
Task 2: Explore an existing pipeline
To preview this task, watch the video beginning at 04:00.
The sample project includes an Orchestration Pipelines, which automates the following tasks:
-
Run two existing DataStage jobs.
-
Create an AutoAI experiment.
-
Run the AutoAI experiment and save the best performing model that uses the resulting output file from the DataStage job as the training data.
-
Create a deployment space.
-
Promote the saved model to the deployment space.
Follow these steps to explore the pipeline:
-
From the Assets tab in the Orchestrate an AI pipeline project, view All assets.
-
Click Mortgage approval pipeline to open the pipeline.
-
In the beginning section of the pipeline, two DataStage jobs (Integrate Mortgage Data and Integrate Mortgage Approvals) run in sequence to combine various tables from the Db2 Warehouse on Cloud connection into a cohesive labeled data set that is used as the training data for the AutoAI experiment.
-
Double-click the Check Status node to see the condition. This condition is a decision point in the pipeline to confirm the completion of the first DataStage job with a value of either Completed or Completed With Warnings. Click Cancel to return to the pipeline.
-
Double-click the Create AutoAI experiment node to see the settings. This node creates an AutoAI experiment with the settings.
-
Review the values for the following settings:
-
AutoAI experiment name
-
Scope
-
Prediction type
-
Prediction column
-
Positive class
-
Training data split ratio
-
Algorithms to include
-
Algorithms to use
-
Optimize metric
-
-
Click Cancel to close the settings.
-
-
Double-click the Run AutoAI experiment node to see the settings. This node runs the AutoAI experiment that is created from the Create AutoAI experiment node that uses the output from the Integrate Mortgage Approval DataStage job as the training data.
-
Review the values for the following settings:
-
AutoAI experiment
-
Training Data Assets
-
Model name prefix
-
-
Click Cancel to close the settings.
-
-
Between the Run AutoAI experiment and Create Deployment Space nodes, double-click the Do you want to deploy model? node to see the condition. The value of True for this condition is a decision point in the pipeline to continue to create the deployment space. Click Cancel to return to the pipeline.
-
Double-click the Create Deployment Space node to see the settings. This node creates a new deployment space with the specified name, and requires input for your Cloud Object Storage and Watson Machine Learning services.
-
Review the value for the New space name setting.
-
For the New space COS Instance CRN field, select your Cloud Object Storage instance from the list.
-
For the New space WML Instance CRN field, select your Watson Machine Learning instance from the list.
-
Click Save.
-
-
Double-click the Promote Model to Deployment Space node to see the settings. This node promotes the best model from the Run AutoAI experiment node to the deployment space created from the Create Deployment Space node.
-
Review the values for the following settings:
-
Source Assets
-
Target
-
-
Click Cancel to close the settings.
-
Check your progress
The following image shows the initial pipeline. You are now ready to edit the pipeline to add a node.
Task 3: Add a node to the pipeline
To preview this task, watch the video beginning at 06:23.
The pipeline creates the model, creates a deployment space, and then promotes it to a deployment space. You need to add a node to create an online deployment. Follow these steps to edit the pipeline to automate creating an online deployment:
-
Add the Create Online Deployment node to the canvas:
-
Expand the Create section in the node palette.
-
Drag the Create online deployment node onto the canvas, and drop the node after the Promote Model to Deployment Space node.
-
-
Hover over the Promote Model to Deployment Space node to see the arrow. Connect the arrow to the Create online deployment node.
Note: The node names in your pipeline might differ from the following animated image. -
Connect the Create online deployment for promoted model comment to the Create online deployment node by connecting the circle on the comment box to the node.
Note: The node names in your pipeline might differ from the following animated image. -
Double-click the Create online deployment node to see the settings.
-
Change the node name to
Create Online Deployment
. -
Next to ML asset, click Select from another node from the menu.
-
Select the Promote Model to Deployment Space node from the list. The node ID winning_model is selected.
-
For the New deployment name, type
mortgage approval model deployment
. -
For Creation Mode, select Overwrite.
-
Click Save to save the Create Online Deployment node settings.
Check your progress
The following image shows the completed pipeline. You are now ready to run the pipeline.
Task 4: Run the pipeline
To preview this task, watch the video beginning at 07:38.
Now that the pipeline is complete, follow these steps to run the pipeline:
-
From the toolbar, click Run pipeline > Trial run.
-
On the Define pipeline parameters page, select True for the deployment.
-
If set to True, then the pipeline verifies the deployed model and scores the model.
-
If set to False, then the pipeline verifies that the model was created in the project by the AutoAI experiment, and reviews the model information and training metrics.
-
-
If this occasion is your first time running a pipeline, you are prompted to provide an API key. Pipeline assets use your personal IBM Cloud API key to run operations securely without disruption.
-
If you have an existing API key, click Use existing API key, paste the API key, and click Save.
-
If you don't have an existing API key, click Generate new API key, provide a name, and click Save. Copy the API key, and then save the API key for future use. When you're done, click Close.
-
-
Click Run to start running the pipeline.
-
Scroll through consolidated logs while the pipeline is running. The trial run might take up to 10 minutes to complete.
-
As each operation completes, select the node for that operation on the canvas.
-
On the Node Inspector tab, view the details of the operation.
-
Click the Node output tab to see a summary of the output for each node operation.
Check your progress
The following image shows the pipeline after it completed the trial run. You are now ready to review the assets that the pipeline created.
Task 5: View the assets, deployed model, and online deployment
To preview this task, watch the video beginning at 09:48.
The pipeline created several assets. Follow these steps to view the assets:
-
Click the Orchestrate an AI pipeline project name in the navigation trail to return to the project.
-
On the Assets tab, view All assets.
-
View the data assets.
-
Click the Mortgage_Data.csv data asset. The DataStage job created this asset.
-
Click the project name in the navigation trail to return to the Assets tab.
-
Click the Mortgage_Data_with_Approvals.csv data asset. The DataStage job created this asset.
-
Click the project name in the navigation trail to return to the Assets tab.
-
-
View the model.
-
Click the machine learning model asset beginning with mortgage_approval_best_model. The AutoAI experiment generated several model candidates, and chose this as the best model.
-
Scroll through the model information.
-
Click the project name in the navigation trail to return to the Assets tab.
-
-
Click the Jobs tab in the project to see information about the two DataStage jobs and one Pipeline job runs.
-
From the Navigation Menu , choose Deployments.
-
Click the Spaces tab.
-
Click the Mortgage approval deployment space.
-
Click the Assets tab, and see the deployed model beginning with mortgage_approval_best_model.
-
Click the Deployments tab.
-
Click mortgage approval model deployment to view the deployment.
-
View the information on the API reference tab.
-
Click the Test tab.
-
Click the JSON input tab, and replace the sample text with the following JSON text.
{ "input_data": [ { "fields": [ "ID", "NAME", "STREET_ADDRESS", "CITY", "STATE", "STATE_CODE", "ZIP_CODE", "EMAIL_ADDRESS", "PHONE_NUMBER", "GENDER", "SOCIAL_SECURITY_NUMBER", "EDUCATION", "EMPLOYMENT_STATUS", "MARITAL_STATUS", "INCOME", "APPLIEDONLINE", "RESIDENCE", "YRS_AT_CURRENT_ADDRESS", "YRS_WITH_CURRENT_EMPLOYER", "NUMBER_OF_CARDS", "CREDITCARD_DEBT", "LOANS", "LOAN_AMOUNT", "CREDIT_SCORE", "CRM_ID", "COMMERCIAL_CLIENT", "COMM_FRAUD_INV", "FORM_ID", "PROPERTY_CITY", "PROPERTY_STATE", "PROPERTY_VALUE", "AVG_PRICE" ], "values": [ [ null, null, null, null, null, null, null, null, null, null, null, "Bachelor", "Employed", null, 144306, null, "Owner Occupier", 15, 19, 2, 7995, 1, 1483220, 437, null, false, false, null, null, null, 111563 ], [ null, null, null, null, null, null, null, null, null, null, null, "High School", "Employed", null, 45283, null, "Private Renting", 11, 13, 1, 1232, 1, 7638, 706, null, false, false, null, null, null, 547262 ] ] } ] }
-
Click Predict. The results show that the first applicant would not be approved and the second applicant will be approved.
-
Check your progress
The following image shows the results of the test.
Golden Bank's team used Orchestration Pipelines to create a data pipeline that delivers up-to-date data on all mortgage applicants and a machine learning model that lenders can use for decision making.
Cleanup (Optional)
If you would like to retake this tutorial, delete the following artifacts.
Artifact | How to delete |
---|---|
Mortgage Approval Model Deployment in the Mortgage approval deployment space | Delete a deployment |
Mortgage approval deployment space | Delete a deployment space |
Orchestrate an AI pipeline sample project | Delete a project |
Next steps
-
Try these tutorials:
-
Sign up for another Data fabric use case.
Learn more
Parent topic: Use case tutorials