Customizing RAG experiment settings
When you build a retrieval-augmented generation solution in AutoAI, you can customize experiment settings to tailor your results.
If you run a RAG experiment based on default settings, the AutoAI process selects:
- The optimization metric to be maximized when searching for the best RAG pipeline
- The embedding models to try, based on the available list
- The foundation models to try, based on the available list
To exercise more control over the RAG experiment, you can customize the experiment settings. After entering the required experiment definition information, click Experiment settings to customize options before running the experiment. Settings you can review or edit fall into three categories:
- Retrieval & generation: choose which metric to use to optimize the choice of RAG pattern, how much data to retrieve, and the models AutoAI can use for the experiment.
- Indexing: choose how the data is broken down into chunks, the metric used to measure semantic similarity, and which embedding model AutoAI can use for experimentation.
- Additional information: review the watsonx.ai Runtime instance and the environment to use for the experiment.
Retrieval and generation settings
View or edit the settings that are used to generate the RAG pipelines.
Optimization metric
Choose the metric to maximize when searching for the optimal RAG patterns. For more information about optimization metrics and their implementation details, see RAG metrics.
- Answer faithfulness measures how closely the generated response aligns to the context retrieved from the vector store. The score is calculated using a lexical metric which counts how many of the generated response tokens are included in the context retrieved from the vector store. A high score indicates the response represents the retrieved context well. Note that a high faithfulness score does not necessarily indicate correctness of the response. For more information on how the metric is implemented, see Faithfulness.
- Answer correctness measures the correctness of the generated answer compared to the correct answer provided in the benchmark files. This includes the relevance of the retrieved context and the quality of the generated response. The score is calculated using a lexical metric which counts how many of the ground-truth response tokens are included in the generated response. For more information on how the metric is implemented, see Correctness.
- Context correctness indicates to what extent the context retrieved from the vector store aligns with the ground truth context provided in the benchmark. The score is calcuated based on the rank of the ground truth context among the retrieved chunks. The closer the ground truth context is to the top of the list, the higher the score. For more information on how the metric is implemented, see Context correctness.
Retrieval methods
You can select configurations automatically for retrieving relevant data or edit the configuration settings. Retrieval methods differ in the ways that they filter and rank documents.
-
Choose the window retrieval method or the simple retrieval method.
- Window retrieval method surrounds retrieved chunks with additional chunks before and after the chunks, based on what was in the original document. This method is useful for including more context that might be missing
in the originally retrieved chunk. Window retrieval works as follows:
- Search: Finds the most relevant document chunks in the vector store.
- Expand: For each found chunk, retrieves surrounding chunks to provide context.
- Each chunk stores its sequence number in its metadata.
- After retrieving a chunk, the chunk metadata is used to fetch neighboring chunks from the same document. For example, if
window_size
is 2, it adds 2 chunks before it and 2 chunks after it. - Merge: Combines overlapping text within the window to remove redundancy.
- Metadata handling: Merges metadata dictionaries by keeping the same keys and grouping values into lists.
- Return: Outputs the merged window as a new chunk, replacing the original one.
- Simple retrieval method finds the most relevant chunks in the vector store.
- Window retrieval method surrounds retrieved chunks with additional chunks before and after the chunks, based on what was in the original document. This method is useful for including more context that might be missing
in the originally retrieved chunk. Window retrieval works as follows:
-
Select the number of chunks from 1 to 10. The number of retrieval chunks determines the number of smaller chunks into which a retrieved text passage is split.
-
If you select the window retrieval method, you can set the window size from 1 to 4. The window size is the number of adjacent chunks considered by the model when retrieving information from the indexed documents.
-
You can optionally select a hybrid strategy to improve the output quality. A hybrid strategy combines sparse and dense embeddings vectors to perform a similarity search in the vector database. Sparse embeddings prioritize exact keyword matches, and dense embeddings prioritize outputs that have semantic similarity. Combining sparse and dense embeddings improves search accuracy and relevance, resulting in more comprehensive information retrieval from the database. This setting is not available for the in-memory Chroma vector database. If you're using the Elasticsearch vector database, you need to install the ELSER model.
Choose from one of these hybrid strategy options:
- RRF (Reciprocal Rank Fusion): Combines rankings from multiple sources into a single, more relevant list. To use RRF with the Elasticsearch vector database, you need Elasticsearch version 8.8 or later.
- Weighted: Assigns importance to outputs and prioritizes the most reliable one for the final output.
- None: Uses only dense embeddings, without a hybrid strategy.
Foundation models to include
You can choose between using provided foundation models or custom foundation models.
By default, all available provided foundation models that support AutoAI for RAG are selected for experimentation. You can manually edit the list of provided foundation models that AutoAI can consider for generating RAG patterns. For each model, you can click Model details to view or export details about the model.
For the list of available provided foundation models along with descriptions, see Foundation models by task.
To use custom foundation models, click Custom models and select the models you want AutoAI to consider for generating RAG patterns. The list of custom models includes deploy on demand models and the custom models that are deployed in the project where you are running the experiment and all spaces where you are a member.
To add a new custom foundation model, see Deploying custom foundation models.
For information on how to code an experiment with a custom foundation model, see Coding an AutoAI RAG experiment with a custom foundation model.
Max RAG patterns to complete
You can specify the number of RAG patterns to complete in the experimentation phase, up to a maximum of 20. A higher number compares more patterns, and might result in higher-scoring patterns, but consumes more compute resources.
Match input language
By default, AutoAI automatically detects the language used in prompts and instructs models to respond in the same language. Models that do not support the input language are given lower priority in the search for the RAG pattern. Turn off this setting to consider all available models and generate responses in English only.
Indexing settings
View or edit the settings for creating the text vector database from the document collection.
Chunking
You can select configurations automatically for chunking your data or edit the configuration settings. Chunking settings determine how indexed documents are broken down into smaller pieces before ingestion into a vector store. Chunking data allows search and retrieval of those chunks in a document most relevant to a query. This allows the generation model to process only the most relevant data.
AutoAI RAG uses LangChain’s recursive text splitter to break down the documents into chunks. This method has the effect of decomposing the document in a hierarchical fashion, trying to keep all paragraphs (and then sentences, and then words) together as long as possible, until the chunk is smaller than the requested chunk size. For more information about the recursive chunking method, see Retrieval recursively split by character in the Langchain documentation.
How to best chunk your data depends on your use case. Smaller chunks provide a more granular interaction with text, enabling more focused search for relevant content, whereas larger chunks can provide more context. For your chunking use case, specify one or more options for:
- The number of characters to include in each chunk of data.
- The number of characters to overlap for chunking data. The number must be smaller than the chunking size.
The selected options are explored and compared in the experimentation phase.
Embedding models
Embedding models are used in retrieval-augmented generation solutions for encoding chunks and queries as vectors to capture their semantic meaning. The vectorized input data chunks are ingested into a vector store. Given a query, the vectorized representation is used to search the vector store for relevant chunks.
For a list of embedding models available for use with AutoAI RAG experiments, see Supported encoder models available with watsonx.ai.
Additional information
Review the watsonx.ai Runtime instance used for this experiment and the environment definition.
Configuration parameters for experiment settings
If you are coding an AutoAI RAG experiment, you can configure parameters programmatically using the rag_optimizer
object. For more information about how to initialize the RAG optimizer with customized parameters, see Working with AutoAI RAG class and rag_optimizer.
Parameter | Description | Values |
---|---|---|
name | Enter a valid name for the experiment | Experiment name |
description | Optionally describe the experiment | Experiment description |
chunking | Chunking settings for document splitting | {"method": "recursive", "chunk_size": 256, "chunk_overlap": 128} |
embedding_models | Embedding models to try | ibm/slate-125m-english-rtrvr intfloat/multilingual-e5-large |
retrieval | Retrieval settings | Use AutoAIRAGRetrievalConfig dataclass |
foundation_models | Foundation models or custom models to use | See Foundation models by task Use AutoAIRAGModelConfig or AutoAIRAGCustomModelConfig |
generation | Generation step configuration | {"language": {"auto_detect": False}} |
max_number_of_rag_patterns | Maximum number of RAG patterns to create | 4–20 |
optimization_metrics | Metric name(s) to use for optimization | faithfulness answer_correctness context_correctness |
Below are examples of code you can use to initialize the RAG optimizer and configure each parameter.
Example of a retrieval configuration:
from ibm_watsonx_ai.foundation_models.schema import AutoAIRAGRetrievalConfig, AutoAIRAGHybridRankerParams, HybridRankerStrategy
from ibm_watsonx_ai.foundation_models.extensions.rag.retriever import RetrievalMethod
retrieval_config = AutoAIRAGRetrievalConfig(
method=RetrievalMethod.SIMPLE,
number_of_chunks=5,
window_size=2,
hybrid_ranker=AutoAIRAGHybridRankerParams(
strategy=HybridRankerStrategy.RRF,
sparse_vectors={"model_id": "elser_model_2"},
alpha=0.9,
k=70,
)
)
Example of a foundation model configuration:
from ibm_watsonx_ai.foundation_models.schema import (
AutoAIRAGModelConfig,
AutoAIRAGCustomModelConfig,
AutoAIRAGModelParams,
TextGenDecodingMethod
)
# Foundation model
model_id = "meta-llama/llama-3-1-8b-instruct"
# Foundation model with properties
fm = AutoAIRAGModelConfig(
model_id="ibm/granite-13b-instruct-v2",
parameters=AutoAIRAGModelParams(
decoding_method=TextGenDecodingMethod.SAMPLE,
min_new_tokens=5,
max_new_tokens=300,
max_sequence_length=4096,
),
prompt_template_text="My question {question} related to these documents {reference_documents}.",
context_template_text="My document {document}",
word_to_token_ratio=1.5,
)
# Custom foundation model with properties
custom_fm = AutoAIRAGCustomModelConfig(
deployment_id="<PASTE_DEPLOYMENT_ID_HERE>",
space_id="<PASTE_SPACE_ID_HERE>",
parameters=AutoAIRAGModelParams(
decoding_method=TextGenDecodingMethod.GREEDY,
min_new_tokens=5,
max_new_tokens=300,
max_sequence_length=4096,
),
prompt_template_text="My question {question} related to these documents {reference_documents}.",
context_template_text="My document {document}",
word_to_token_ratio=1.5,
)
foundation_models = [model_id, fm, custom_fm]
Example of a chunking configuration:
chunking_config = {
"method": "recursive",
"chunk_size": 256,
"chunk_overlap": 128,
}
Example of an initialization of a RAG optimizer with a customized configuration:
from ibm_watsonx_ai.experiment import AutoAI
experiment = AutoAI(credentials, project_id=project_id)
rag_optimizer = experiment.rag_optimizer(
name="DEMO - AutoAI RAG ibm-watsonx-ai SDK documentation",
description="AutoAI RAG experiment grounded with the ibm-watsonx-ai SDK documentation",
embedding_models=["ibm/slate-125m-english-rtrvr", "intfloat/multilingual-e5-large"],
foundation_models=foundation_models,
retrieval=[retrieval_config],
chunking=[chunking_config],
generation={"language": {"auto_detect": False}},
max_number_of_rag_patterns=5,
optimization_metrics=[AutoAI.RAGMetrics.ANSWER_CORRECTNESS],
)
Learn more
Retrieval-Augmented Generation (RAG)
Parent topic: Creating a RAG experiment