Evaluate metrics

Last updated: Jul 07, 2025
Evaluate metrics

The evaluate metrics module can help you calculate LLM metrics.

Evaluate metrics is a module in the ibm-watsonx-gov Python SDK that contains methods to compute scores for the context relevance, faithfulness, and answer similarity metrics. You can use model insights to visualize the evaluation results.

To use the metrics evaluation module you must install the ibm-watsonx-gov Python SDK with specific settings:

pip install "ibm-watsonx-gov[metrics]"

Examples

You can use the evaluate metrics module to calculate metrics as shown in the following examples:

Simplified metrics evaluation

from ibm_watsonx_gov.evaluators.metrics_evaluator import MetricsEvaluator
from ibm_watsonx_gov.metrics import AnswerRelevanceMetric
os.environ["WATSONX_APIKEY"] = "..."

evaluator = MetricsEvaluator()
metrics = [AnswerSimilarityMetric()]
result = evaluator.evaluate(data=input_df, metrics=metrics)

Advanced metrics evaluation

from ibm_watsonx_gov.evaluators.metrics_evaluator import MetricsEvaluator
from ibm_watsonx_gov.metrics import AnswerRelevanceMetric
from ibm_watsonx_gov.config import GenAIConfiguration
from ibm_watsonx_gov.clients.api_client import APIClient
from ibm_watsonx_gov.credentials import Credentials

config = GenAIConfiguration(input_fields=["question"],
                    context_fields=["context"],
                    output_fields=["generated_text"],
                    reference_fields=["reference_answer"])
wxgov_client = APIClient(credentials=Credentials(api_key=""))
evaluator = MetricsEvaluator(configuration=config, api_client=wxgov_client)
metrics = [AnswerSimilarityMetric()]

result = evaluator.evaluate(data=input_df, metrics=metrics)

For more information, see the Evaluate metrics notebook.

Parent topic: Metrics computation using Python SDK