Evaluation metrics

Last updated: Jul 25, 2025

Evaluation metrics

Evaluation metrics can help you continuously monitor the performance of your AI models to provide insights throughout the AI lifecycle. With watsonx.governance, you can use these metrics to help ensure compliance with regulatory requirements and identify how to make improvements to mitigate risks.

You can run evaluations in Watsonx.governance to generate metrics with automated monitoring that can provide actionable insights to help you achieve your AI governance goals. You can use these metrics to help achieve the following goals:

Ensure compliance: Automatically track adherence to evolving regulations and organizational policies with alerts triggered when thresholds are breached.
Promote transparency: Generate detailed documentation to provide clear insights into model behavior, performance, and explainability of outcomes.
Mitigate risks: Detect and address issues like bias or accuracy drift through continous evaluation and proactive risk assessments.
Protect privacy and security: Monitor for security vulnerabilities like personally identifiable information exposure (PII) and enforce guardrails to prevent misuse of sensitive data.

The metrics that you can use to provide insights about your model performance are determined by the type of evaluations that you enable. Each type of evaluation generates different metrics that you can analyze to gain insights.

You can also use the ibm-watsonx-gov Python SDK to calculate metrics in a notebook runtime environment or offloaded as Spark jobs against IBM Analytics Engine for evaluations. The Python SDK is a Python library that you can use to programatically monitor, manage, and govern machine learning models. Some metrics might be available only with the Python SDK. For more information, see Metrics computation with the Python SDK.

Drift evaluation metrics

Drift evaluation metrics can help you detect drops in accuracy and data consistency in your models to determine how well your model predicts outcomes over time. Watsonx.governance supports the following drift evaluation metrics for machine learning models.:

Table 1. Drift evaluation metric descriptions
Metric	Description
Drop in accuracy	Estimates the drop in accuracy of your model at run time when compared to the training data
Drop in data consistency	Compares run time transactions with the patterns of transactions in the training data to identify inconsistency

Drift v2 evaluation metrics

Drift v2 evaluation metrics can help you measure changes in your data over time to ensure consistent outcomes for your model. You can use these metrics to identify changes in your model output, the accuracy of your predictions, and the distribution of your input data. Watsonx.governance supports the following drift v2 metrics:

Table 2. Drift v2 evaluation metric descriptions
Metric	Description
Embedding drift	Detects the percentage of records that are outliers when compared to the baseline data
Feature drift	Measures the change in value distribution for important features
Input metadata drift	Measures the change in distribution of the LLM input text metadata
Model quality drift	Compares the estimated runtime accuracy to the training accuracy to measure the drop in accuracy.
Output drift	Measures the change in the model confidence distribution
Output metadata drift	Measures the change in distribution of the LLM output text metadata.
Prediction drift	Measures the change in distribution of the LLM predicted classes.

Fairness evaluation metrics

Fairness evaluation metrics can you help you determine whether your model produces biased outcomes. You can use these metrics to identify when your model shows a tendency to provide favorable outcomes more often for one group over another. Watsonx.governance supports the following fairness evaluation metrics:

Table 3. Fairness evaluation metric descriptions
Metric	Description
Average absolute odds difference	Compares the average of absolute difference in false positive rates and true positive rates between monitored groups and reference groups
Average odds difference	Measures the difference in false positive and false negative rates between monitored and reference groups
Disparate impact	Compares the percentage of favorable outcomes for a monitored group to the percentage of favorable outcomes for a reference group
Error rate difference	The percentage of transactions that are incorrectly scored by your model
False discovery rate difference	The amount of false positive transactions as a percentage of all transactions with a positive outcome
False negative rate difference	The percentage of positive transactions that were incorrectly scored as negative by your model
False omission rate difference	The number of false negative transactions as a percentage of all transactions with a negative outcome
False positive rate difference	The percentage of negative transactions that were incorrectly scored as positive by your model.
Impact score	Compares the rate that monitored groups are selected to receive favorable outcomes to the rate that reference groups are selected to receive favorable outcomes.
Statistical parity difference	Compares the percentage of favorable outcomes for monitored groups to reference groups.

Generative AI quality evaluation metrics

Generative AI quality evaluation metrics can help you measure how well your foundation model performs tasks. Watsonx.governance supports the following generative AI quality evaluation metrics:

Table 4. Generative AI quality evaluation metric descriptions
Metric	Description
BLEU (Bilingual Evaluation Understudy)	Compares translated sentences from machine translations to sentences from reference translations to measure the similarity between reference texts and predictions
Exact match	Compares model prediction strings to reference strings to measure how often the strings match.
METEOR (Metric for Evaluation of Translation with Explicit ORdering)	Measures how well the text that is generated with machine translations match the structure of the text from reference translations
Readability	Determines how difficult the model's output is to read by measuring characteristics such as sentence length and word complexity
ROUGE (Recall-Oriented Understudy for Gisting Evaluation)	Measure how well generated summaries or translations compare to reference outputs.
SARI (system output against references and against the input sentence)	Compares the predicted sentence output against the reference sentence output to measure the quality of words that the model uses to generate sentences
Sentence similarity	Captures semantic information from sentence embeddings to measure the similarity between texts
Text quality	Evaluates the output of a model against SuperGLUE datasets by measuring the F1 score, precision, and recall against the model predictions and its ground truth data

Watsonx.governance also supports the following different categories of generative AI quality metrics:

Answer quality metrics

You can use answer quality metrics to evaluate the quality of model answers. Answer quality metrics are calculated with LLM-as-a-judge models. To calculate the metrics with LLM-as-a-judge models, you can create a scoring function that calls the models. For more information, see the Computing Answer Quality and Retrieval Quality Metrics using IBM watsonx.governance for RAG task notebook.

You can calculate the following answer quality metrics:

Table 5. Answer quality evaluation metric descriptions
Metric	Description
Answer relevance	Measures how relevant the answer in the model output is to the question in the model input
Answer similiarity	Measures how similar the answer or generated text is to the ground truth or reference answer to determine the quality of your model performance
Faithfulness	Measures how grounded the model output is in the model context and provides attributions from the context to show the most important sentences that contribute to the model output.
Unsuccessful requests	Measures the ratio of questions that are answered unsuccessfully out of the total number of questions

Content analysis metrics

You can use the following content analysis metrics to evaluate your model output against your model input or context:

Table 6. Content analysis evaluation metric descriptions
Metric	Description
Abstractness	Measures the ratio of n-grams in the generated text output that do not appear in the source content of the foundation model
Compression	Measures how much shorter the summary is when compared to the input text by calculating the ratio between the number of words in the original text and the number of words in the foundation model output
Coverage	Measures the extent that the foundation model output is generated from the model input by calculating the percentage of output text that is also in the input
Density	Measures how extractive the summary in the foundation model output is from the model input by calculating the average of extractive fragments that closely resemble verbatim extractions from the original text
Repetitiveness	Measures the percentage of n-grams that repeat in the foundation model output by calculating the number of repeated n-grams and the total number of n-grams in the model output

Data safety metrics

You can use the following data safety metrics to identify whether your model's input or output contains harmful or sensitive information:

Table 7. Data safety evaluation metric descriptions
Metric	Description
HAP	Measures if there is any toxic content that contains hate, abuse, or profanity in the model input or output data.
PII	Measures if your model input or output data contains any personally identifiable information by using the Watson Natural Language Processing entity extraction model

Multi-label/class metrics

You can use the following multi-label/class metrics to measure model performance for multi-label/multi-class predictions:

Table 8. Multi-label/class evaluation metric descriptions
Metric	Description
Macro F1 score	The average of F1 scores calculated separately for each class
Macro precision	The average of precision scores calculated separately for each class
Macro recall	The average of recall scores calculated separately for each class
Micro F1 score	Calculates the harmonic mean of precision and and recall
Micro precision	The ratio of the number of correct predictions over all classes to the number of total predictions.
Micro recall	The ratio of the number of correct predictions over all classes to the number of true samples.

Retrieval quality metrics

You can use the retrieval quality metrics to measure the quality of how the retrieval system ranks relevant contexts. Retrieval quality metrics are calculated with LLM-as-a-judge models. To calculate the metrics with LLM-as-a-judge models, you can create a scoring function that calls the models. For more information, see the Computing Answer Quality and Retrieval Quality Metrics using IBM watsonx.governance for RAG task notebook.

You can calculate the following retrieval quality metrics:

Table 9. Retrieval quality evaluation metric descriptions
Metric	Description
Average precision	Evaluates whether all of the relevant contexts are ranked higher or not by calculating the mean of the precision scores of relevant contexts
Context relevance	Measures how relevant the context that your your model retrieves is with the question that is specified in the prompt
Hit rate	Measures whether there is at least one relevant context among the retrieved contexts.
Normalized Discounted Cumulative Gain	Measures the ranking quality of the retrieved contexts
Reciprocal rank	The reciprocal rank of the first relevant context
Retrieval precision	Measures the quanity of relevant contexts from the total of contexts that are retrieved

Model health monitor evaluation metrics

Model health monitor evaluation metrics can help you understand your model behavior and performance by determining how efficiently your model deployment processes your transactions. Model health evaluation metrics are enabled by default for machine learning model evaluations in production and generative AI asset deployments. Watsonx.governance supports the following model health monitor evaluation metrics:

Table 10. Model health monitor evaluation metric descriptions
Metric	Description
Payload size	The total, average, minimum, maximum, and median payload size of the transaction records that your model deployment processes across scoring requests in kilobytes (KB)
Records	The total, average, minimum, maximum, and median number of transaction records that are processed across scoring requests
Scoring requests	The number of scoring requests that your model deployment receives
Users	The number of users that send scoring requests to your model deployments

Watsonx.governance also supports the following different categories of model health monitor evaluation metrics:

Token counts

The following token count metrics calculate the number of tokens that are processed across scoring requests for your model deployment:

Table 11. Model health monitor token count evaluation metric descriptions
Metric	Description
Input token count	Calculates the total, average, minimum, maximum, and median input token count across multiple scoring requests during evaluations
Output token count	Calculates the total, average, minimum, maximum, and median output token count across scoring requests during evaluations

Throughput and latency

Model health monitor evaluations calculate latency by tracking the time that it takes to process scoring requests and transaction records per millisecond (ms). Throughput is calculated by tracking the number of scoring requests and transaction records that are processed per second.

The following metrics are calculated to measure thoughput and latency during evaluations:

Table 12. Model health monitor throughput and latency metric descriptions
Metric	Description
API latency	Time taken (in ms) to process a scoring request by your model deployment.
API throughput	Number of scoring requests processed by your model deployment per second

Quality evaluation metrics

Quality evaluations can you help you measure your model's ability to provide correct outcomes based on how well the model performs. Watsonx.governance supports the following quality evaluation metrics:

Table 13. Quality evaluation metric descriptions
Metric	Description
Accuracy	Measures how correct your model predictions are by calculating the proportion of correct results among the total number of results.
Area under PR	Measures how well your model balances correctly identifying positive classes with finding all positive classes
Area under ROC	Measures how well your model identifies differences between classes.
Brier score	Measures the mean squared difference between the predicted probability and the target value.
F1-Measure	Measures Harmonic mean of precision and recall
Gini coefficient	Measures how well models distinguish between two classes
Label skew	Measures the asymmetry of label distributions
Logarithmic loss	Mean of logarithms target class probabilities (confidence)
Matthews correlation coefficient	The quality of binary and multiclass classifications by accounting for true and false positives and negatives
Mean-absolute error	Mean of absolute difference between model prediction and target value
Mean absolute percentage error	Measures the mean percentage error difference between the predicted and actual values
Mean-squared error	Mean of squared difference between model prediction and target value
Pearson correlation coefficient	Measures the linear relationship between model prediction and target values.
Precision	Proportion of correct predictions in predictions of positive class
Proportion explained variance	The ratio of explained variance and target variance. Explained variance is the difference between target variance and variance of prediction error.
Recall	Proportion of correct predictions in positive class
Root of mean squared error	Square root of mean of squared difference between model prediction and target value
R-squared	Ratio of difference between target variance and variance for prediction error to target variance
Spearman correlation coefficient	Measures the monotonicity of the relationship between model predictions and target values.
Symmetric mean absolute percentage error	Measures the symmetric mean of the percentage error of difference between the predicted and actual values
True positive rate	Proportion of correct predictions in predictions of positive class
Weighted false positive rate	Proportion of incorrect predictions in positive class
Weighted F1-Measure	Weighted mean of F1-measure with weights equal to class probability
Weighted precision	Weighted mean of precision with weights equal to class probability
Weighted recall	Weighted mean of recall with weights equal to class probability

Parent topic: Evaluating AI models

Was the topic helpful?

0/1000