Customizing RAG experiment settings

Last updated: Jul 30, 2025

When you build a retrieval-augmented generation solution in AutoAI, you can customize experiment settings to tailor your results.

If you run a RAG experiment based on default settings, the AutoAI process selects:

The optimization metric to be maximized when searching for the best RAG pipeline
The embedding models to try, based on the available list
The foundation models to try, based on the available list

To exercise more control over the RAG experiment, you can customize the experiment settings. After you enter the required experiment definition information, click Experiment settings to customize options before you run the experiment. Settings that you can review or edit fall into three categories:

Retrieval & generation: choose which metric to use to optimize the choice of RAG pattern, how much data to retrieve, and the models AutoAI can use for the experiment.
Indexing: choose how the data is broken down into chunks, the metric used to measure semantic similarity, and which embedding model AutoAI can use for experimentation.
Additional information: review the watsonx.ai Runtime instance and the environment to use for the experiment.

Retrieval and generation settings

View or edit the settings that are used to generate the RAG pipelines.

Optimization metric

Choose the metric to maximize when searching for the optimal RAG patterns. For more information about optimization metrics and their implementation details, see RAG metrics.

Answer faithfulness measures how closely the generated response aligns to the context retrieved from the vector store. The score is calculated using a lexical metric which counts how many of the generated response tokens are included in the context retrieved from the vector store. A high score indicates the response represents the retrieved context well. Note that a high faithfulness score does not necessarily indicate correctness of the response. For more information on how the metric is implemented, see Faithfulness.
Answer correctness measures the correctness of the generated answer compared to the correct answer provided in the benchmark files. This includes the relevance of the retrieved context and the quality of the generated response. The score is calculated using a lexical metric which counts how many of the ground-truth response tokens are included in the generated response. For more information on how the metric is implemented, see Correctness.
Context correctness indicates to what extent the context retrieved from the vector store aligns with the ground truth context provided in the benchmark. The score is calcuated based on the rank of the ground truth context among the retrieved chunks. The closer the ground truth context is to the top of the list, the higher the score. For more information on how the metric is implemented, see Context correctness.

Large Language Model as a Judge (LLMaaJ)

You can also choose to optimize using LLM-as-a-Judge (LLMaaJ) metrics:

Answer faithfulness (LLMaaJ)
Answer correctness (LLMaaJ)

LLMaaJ metrics measure the same qualities as the standard metrics, but instead of using a formula based on token overlap, they use an LLM to assess the quality of the generated output. The LLM gets a prompt that contains the question, the generated output, and the retrieved context or the benchmark answer, depending on the chosen metric. Then the LLM evaluates if the output is semantically accurate, factually correct, and in alignment with the data that is retrieved from the documentation index.

Compared to formula-based scoring, LLMs can provide more accurate and nuanced assessments, especially for open-ended or complex answers. This is because LLMs can provide more thorough evaluation by recognizing valid paraphrases, detecting hallucinations, and better reflecting human judgment.

To use LLMaaJ metrics, you must have one of the required models installed in your environment:

mistralai/mistral-small
mistralai/mistral-medium
mistralai/mistral-large
meta-llama/llama

Retrieval methods

You can select configurations automatically for retrieving relevant data or edit the configuration settings. Retrieval methods differ in the ways that they filter and rank documents.

Choose the window retrieval method or the simple retrieval method.
- Window retrieval method surrounds retrieved chunks with additional chunks before and after the chunks, based on what was in the original document. This method is useful for including more context that might be missing in the originally retrieved chunk. Window retrieval works as follows:
  - Search: Finds the most relevant document chunks in the vector store.
  - Expand: For each found chunk, retrieves surrounding chunks to provide context.
  - Each chunk stores its sequence number in its metadata.
  - After retrieving a chunk, the chunk metadata is used to fetch neighboring chunks from the same document. For example, if window_size is 2, it adds 2 chunks before it and 2 chunks after it.
  - Merge: Combines overlapping text within the window to remove redundancy.
  - Metadata handling: Merges metadata dictionaries by keeping the same keys and grouping values into lists.
  - Return: Outputs the merged window as a new chunk, replacing the original one.
- Simple retrieval method finds the most relevant chunks in the vector store.
Select the number of chunks from 1 to 10. The number of retrieval chunks determines the number of smaller chunks into which a retrieved text passage is split.
If you select the window retrieval method, you can set the window size from 1 to 4. The window size is the number of adjacent chunks considered by the model when retrieving information from the indexed documents.
You can optionally select a hybrid strategy to improve the output quality. A hybrid strategy combines sparse and dense embeddings vectors to perform a similarity search in the vector database. Sparse embeddings prioritize exact keyword matches, and dense embeddings prioritize outputs that have semantic similarity. Combining sparse and dense embeddings improves search accuracy and relevance, resulting in more comprehensive information retrieval from the database. This setting is not available for the in-memory Chroma vector database. If you're using the Elasticsearch vector database, you need to install the ELSER model.

Choose from one of these hybrid strategy options:
- RRF (Reciprocal Rank Fusion): Combines rankings from multiple sources into a single, more relevant list. To use RRF with the Elasticsearch vector database, you need Elasticsearch version 8.8 or later.
- Weighted: Assigns importance to outputs and prioritizes the most reliable one for the final output.
- None: Uses only dense embeddings, without a hybrid strategy.
If you are using a hybrid strategy, you can select a sparse embedding model family. If you have multiple versions of the same model installed, the most recent version is used by AutoAI.

Select one of these model family options:
- Automatic: Automatically selects the model based on your vector store choice and the models you have installed in the vector store. If you are using the Elasticsearch vector store and have the ELSER model installed, ELSER will be selected. But if ELSER is not installed, BM25 is selected.
- ELSER: Applies only to the Elasticsearch vector store. To use the ELSER model, it must be installed in the Elasticsearch vector store.
- BM25: Applies to Elasticsearch and Milvus vector store databases.

Foundation models to include

You can choose between using provided foundation models or custom foundation models.

By default, all available provided foundation models that support AutoAI for RAG are selected for experimentation. You can manually edit the list of provided foundation models that AutoAI can consider for generating RAG patterns. For each model, you can click Model details to view or export details about the model.

For the list of available provided foundation models along with descriptions, see Foundation models by task.

To use custom foundation models, click Custom models and select the models you want AutoAI to consider for generating RAG patterns. The list of custom models includes deploy on demand models and the custom models that are deployed in the project where you are running the experiment and all spaces where you are a member.

To add a new custom foundation model, see Deploying custom foundation models.

For information on how to code an experiment with a custom foundation model, see Coding an AutoAI RAG experiment with a custom foundation model.

Max RAG patterns to complete

You can specify the number of RAG patterns to complete in the experimentation phase, up to a maximum of 20. A higher number compares more patterns, and might result in higher-scoring patterns, but consumes more compute resources.

Match input language

By default, AutoAI automatically detects the language used in prompts and instructs models to respond in the same language. Models that do not support the input language are given lower priority in the search for the RAG pattern. Turn off this setting to consider all available models and generate responses in English only.

Indexing settings

View or edit the settings for creating the text vector database from the document collection.

Chunking

You can select configurations automatically for chunking your data or edit the configuration settings. Chunking settings determine how indexed documents are broken down into smaller pieces before ingestion into a vector store. Chunking data allows search and retrieval of those chunks in a document most relevant to a query. This allows the generation model to process only the most relevant data.

AutoAI RAG uses LangChain’s recursive text splitter to break down the documents into chunks. This method has the effect of decomposing the document in a hierarchical fashion, trying to keep all paragraphs (and then sentences, and then words) together as long as possible, until the chunk is smaller than the requested chunk size. For more information about the recursive chunking method, see Retrieval recursively split by character in the Langchain documentation.

How to best chunk your data depends on your use case. Smaller chunks provide a more granular interaction with text, enabling more focused search for relevant content, whereas larger chunks can provide more context. For your chunking use case, specify one or more options for:

The number of characters to include in each chunk of data.
The number of characters to overlap for chunking data. The number must be smaller than the chunking size.

The selected options are explored and compared in the experimentation phase.

Embedding models

Embedding models are used in retrieval-augmented generation solutions for encoding chunks and queries as vectors to capture their semantic meaning. The vectorized input data chunks are ingested into a vector store. Given a query, the vectorized representation is used to search the vector store for relevant chunks.

For a list of embedding models available for use with AutoAI RAG experiments, see Supported encoder models available with watsonx.ai.

Additional information

Review the watsonx.ai Runtime instance used for this experiment and the environment definition.

Configuration parameters for experiment settings

If you are coding an AutoAI RAG experiment, you can configure parameters programmatically using the rag_optimizer object. For more information about how to initialize the RAG optimizer with customized parameters, see Working with AutoAI RAG class and rag_optimizer.

Parameter	Description	Values
name	Enter a valid name for the experiment	Experiment name
description	Optionally describe the experiment	Experiment description
chunking	Chunking settings for document splitting	`{"method": "recursive", "chunk_size": 256, "chunk_overlap": 128}`
embedding_models	Embedding models to try	`ibm/slate-125m-english-rtrvr` `intfloat/multilingual-e5-large`
retrieval	Retrieval settings	Use `AutoAIRAGRetrievalConfig` dataclass
foundation_models	Foundation models or custom models to use	See Foundation models by task Use `AutoAIRAGModelConfig` or `AutoAIRAGCustomModelConfig`
generation	Generation step configuration	`{"language": {"auto_detect": False}}`
max_number_of_rag_patterns	Maximum number of RAG patterns to create	4–20
optimization_metrics	Metric name(s) to use for optimization	`faithfulness` `answer_correctness` `context_correctness`

Below are examples of code you can use to initialize the RAG optimizer and configure each parameter.

Example of a retrieval configuration:

from ibm_watsonx_ai.foundation_models.schema import AutoAIRAGRetrievalConfig, AutoAIRAGHybridRankerParams, HybridRankerStrategy
from ibm_watsonx_ai.foundation_models.extensions.rag.retriever import RetrievalMethod


retrieval_config = AutoAIRAGRetrievalConfig(
    method=RetrievalMethod.SIMPLE,
    number_of_chunks=5,
    window_size=2,
    hybrid_ranker=AutoAIRAGHybridRankerParams(
        strategy=HybridRankerStrategy.RRF,
        sparse_vectors={"model_id": "elser_model_2"},
        alpha=0.9,
        k=70,
    )
)

Example of a foundation model configuration:

from ibm_watsonx_ai.foundation_models.schema import (
    AutoAIRAGModelConfig,
    AutoAIRAGCustomModelConfig,
    AutoAIRAGModelParams,
    TextGenDecodingMethod
)

# Foundation model
model_id = "meta-llama/llama-3-1-8b-instruct"

# Foundation model with properties
fm = AutoAIRAGModelConfig(
    model_id="ibm/granite-13b-instruct-v2",
    parameters=AutoAIRAGModelParams(
        decoding_method=TextGenDecodingMethod.SAMPLE,
        min_new_tokens=5,
        max_new_tokens=300,
    max_sequence_length=4096,
    ),
    prompt_template_text="My question {question} related to these documents {reference_documents}.",
    context_template_text="My document {document}",
    word_to_token_ratio=1.5,
)

# Custom foundation model with properties
custom_fm = AutoAIRAGCustomModelConfig(
    deployment_id="<PASTE_DEPLOYMENT_ID_HERE>",
    space_id="<PASTE_SPACE_ID_HERE>",
    parameters=AutoAIRAGModelParams(
        decoding_method=TextGenDecodingMethod.GREEDY,
        min_new_tokens=5,
        max_new_tokens=300,
    max_sequence_length=4096,
    ),
    prompt_template_text="My question {question} related to these documents {reference_documents}.",
    context_template_text="My document {document}",
    word_to_token_ratio=1.5,
)

foundation_models = [model_id, fm, custom_fm]

Example of a chunking configuration:

chunking_config = {
    "method": "recursive",
    "chunk_size": 256,
    "chunk_overlap": 128,
}

Example of an initialization of a RAG optimizer with a customized configuration:

from ibm_watsonx_ai.experiment import AutoAI

experiment = AutoAI(credentials, project_id=project_id)

rag_optimizer = experiment.rag_optimizer(
    name="DEMO - AutoAI RAG ibm-watsonx-ai SDK documentation",
    description="AutoAI RAG experiment grounded with the ibm-watsonx-ai SDK documentation",
    embedding_models=["ibm/slate-125m-english-rtrvr", "intfloat/multilingual-e5-large"],
    foundation_models=foundation_models,
    retrieval=[retrieval_config],
    chunking=[chunking_config],
    generation={"language": {"auto_detect": False}},
    max_number_of_rag_patterns=5,
    optimization_metrics=[AutoAI.RAGMetrics.ANSWER_CORRECTNESS],
)

Learn more

Retrieval-Augmented Generation (RAG)

Parent topic: Creating a RAG experiment

Was the topic helpful?

0/1000