Generating synthetic unstructured data (beta)

Last updated: May 08, 2025
Generating synthetic unstructured data (beta)

With the watsonx.ai synthetic data generation API, you can create large, high-quality unstructured text datasets that mimic your organization's real-time data. Use the generated synthetic datasets to tune and evaluate foundation models for your specific use case.

Note: Generating synthetic unstructured data is available as a beta feature and can be accessed programmatically through the watsonx.ai API in the Sydney and Toronto regions only.

Overview

You can use large language models (LLMs) that are trained with large datasets to generate output that is customized for your organization. However, you must tune the models with a large amount of helpful and accurate training data. A small or low-quality dataset is insufficient to successfully train models to generate output that is relevant to your specific use case.

Use the synthetic data generation API to create large unstructured text datasets by using data builder pipelines and data validators that are optimized for generating data for tuning and evaluating foundation models.

A data builder pipeline generates synthetic data in different formats that mimics the sample seed data and reference documents you provide as an input to the pipeline. Based on your use case, you can choose from the following data builder pipelines:

Tool calling
The tool calling data builder pipeline creates training datasets that can be used to train AI models to interact with external tools, application programming interfaces (APIs), or systems to enhance their capabilities.
Text to SQL
The text to SQL data builder pipeline generates synthetic SQL data triplets that contain a natural language statment describing a database operation, an equivalent SQL statement to perform the database operation, and the database schema.
Knowledge
The knowledge data pipeline generates question and answer (QnA) pairs based on examples in documents that are specific to a business domain.

For more information about seed data formats and choosing a data builder pipeline, see Data builder pipelines and seed data formats.

REST API

You can use the synthetic data generation (SDG) API to administer synthetic unstructured data generation. The synthetic data is generated with foundation models that are provided in watsonx.ai. The format of the generated data is based on sample seed data you provide and the data builder pipeline you use. After the foundation model generates the dataset, the data is validated against the data builder pipeline's quality requirements and stored in your project asset.

Note: Charges for tokens used by foundation models to generate synthetic data still apply during the beta period.

For API method details, see the watsonx.ai API reference documentation.

For more information about best practices to follow when you tune and evaluate foundation models by using the data that is generated with the API, see Best practices.

The following diagram shows the REST API workflow to generate synthetic unstructured data by providing sample seed data in a format that suits your use case.

watsonx.ai synthetic unstructured data generation API workflow

Before you begin

In order to generate synthetic unstructured data programmatically, you must first complete the following setup:

  1. Create a project and have the Admin or Editor role in the project. Your project must have an associated watsonx.ai Runtime service instance.

  2. Create an IBM Cloud user API key and IBM Cloud Identity and Access Management (IAM) token. For details, see Credentials for programmatic access.

  3. Create a task credential.

    A task credential is an API key that is used to authenticate long-running jobs that are started by steps that you will take during the synthetic data generation procedure. You do not need to pass the task credential in the API request. For details, see Creating task credentials.

  4. Optional: Choose a foundation model to use to generate synthetic datasets.

    The following models are certified for use with the Synthetic Data Generator service:

    • granite-3-8b-instruct
    • mistral-large

    The API uses the granite-3-8b-instruct model by default. For model details including billing information and API model IDs, see Supported foundation models.

Procedure

Follow these high-level steps to generate synthetic unstructured text data by using the REST API:

  1. Choose a data builder pipeline and upload the input seed data files to your project asset.

    The format of the sample input data depends on the data builder pipeline you select. For all data builders, you must provide seed data as an input for the data generation request. For some pipelines, you must also provide reference documents. For details, see Data builder pipelines and seed data formats.

  2. Use the Create a synthetic unstructured data generation job REST API method to create the job configuration for your synthetic data generator asset type. You must specify the following settings in your request:

    • The data builder pipeline
    • Reference to your input seed data
    • The number of QnA pairs to generate

    You can optionally specify the API model ID of a foundation model to override the default model setting.

  3. Run the synthetic unstructured data generation job in one of the following ways:

    A job run can take a few minutes or hours to complete, depending on the volume of the generated output, the data builder pipeline, and the model. You can monitor the status of the synthetic unstructured data generation job by clicking the job run to access the log from the Job run details page.

    Attention: You incur charges for tokens that the foundation model generates. For details, see Supported foundation models.
  4. Download the generated output JSONL files that contain the synthetic unstructured data from your project's data asset. The generated data is formatted according to the data builder pipeline you specified in the API request to create the synthetic unstructured data generation job.

Request example

For example, the following command submits a request to generate synthetic unstructured data generation request:

curl -X POST \
  'https://api.{region}.dai.cloud.ibm.com/v1/synthetic_data/generation/unstructured?version=2025-04-17' \
  --header 'Accept: application/json' \
  --header 'Content-Type: application/json' \'
  --header 'Authorization: Bearer eyJraWQiOi...' \'
  --data @payload.json'

The following is an example payload.json file that contains a request body that overrides the default foundation model:

{
    "project_id": "<Your project ID>",
    "name": "<Name of the job that you want to create>",
    "description": "<Description of your project>",
    "pipeline": "<Data builder pipeline>",
    "model_id": "mistralai/mistral-large",
    "parameters": {
         "num_outputs_to_generate": < A value between 1 to 1000 >,
    },
    "seed_data_reference": {
         "type": "container",
         "location": {
            "path": "<Input seed data file name in project asset>"
         }
    },
    "results_reference": {
         "type": "container",
         "location": {
            "path": "<Generated data output file name in project asset>"
         }
    }
}

Output details

During the beta period, you can generate a maximum of 1000 QnA pairs of synthetic data with each REST API request. To generate a larger dataset, contact the support team by opening a case in the IBM Cloud Support portal. For details, see Creating support cases in the IBM Cloud documentation.

Best practices

Use the following guidelines while working with the synthetic data generation API:

  • To select the foundation model best suited for your use case, experiment by generating a small number of QnA pairs with multiple certified foundation models. Change the following setting in your API request to adjust the amount of generated datasets:

    "parameters": {
      "num_outputs_to_generate": 10
    }
    

    After verifying the quality of the generated output, choose a certified foundation model, and proceed with generating larger datasets.

  • Make sure to review the synthetic unstructured data that is produced with the API before you use the data to train your models.

  • To use synthetic data to train models in the Tuning Studio, the dataset must contain input and output attributes.

    Based on the data builder pipeline you use to generate the synthetic data, complete the following steps to make your dataset compatible with Tuning Studio:

    • Tool calling pipeline: No changes required, ready to use.
    • Text to SQL pipeline: Rename the utterance attribute to input. Rename the query attribute to output.
    • Knowledge pipeline: Rename the question attribute to input. Rename the answer attribute to output.

Learn more

Parent topic: Preparing data