Evaluation metrics can help you continuously monitor the performance of your AI models to provide insights throughout the AI lifecycle. With watsonx.governance, you can use these metrics to help ensure compliance with regulatory requirements and
identify how to make improvements to mitigate risks.
You can run evaluations in Watsonx.governance to generate metrics with automated monitoring that can provide actionable insights to help you achieve your AI governance goals. You can use these metrics to help achieve the following goals:
Ensure compliance: Automatically track adherence to evolving regulations and organizational policies with alerts triggered when thresholds are breached.
Promote transparency: Generate detailed documentation to provide clear insights into model behavior, performance, and explainability of outcomes.
Mitigate risks: Detect and address issues like bias or accuracy drift through continous evaluation and proactive risk assessments.
Protect privacy and security: Monitor for security vulnerabilities like personally identifiable information exposure (PII) and enforce guardrails to prevent misuse of sensitive data.
The metrics that you can use to provide insights about your model performance are determined by the type of evaluations that you enable. Each type of evaluation generates different metrics that you can analyze to gain insights.
You can also use the ibm-watsonx-govPython SDK to calculate metrics in a notebook runtime environment or offloaded as Spark jobs against IBM Analytics Engine for evaluations.
The Python SDK is a Python library that you can use to programatically monitor, manage, and govern machine learning models. Some metrics might be available only with the Python SDK. For more information, see Metrics computation with the Python SDK.
Drift evaluation metrics
Copy link to section
Drift evaluation metrics can help you detect drops in accuracy and data consistency in your models to determine how well your model predicts outcomes over time. Watsonx.governance supports the following drift evaluation metrics for machine learning
models.:
Compares run time transactions with the patterns of transactions in the training data to identify inconsistency
Drift v2 evaluation metrics
Copy link to section
Drift v2 evaluation metrics can help you measure changes in your data over time to ensure consistent outcomes for your model. You can use these metrics to identify changes in your model output, the accuracy of your predictions, and the distribution
of your input data. Watsonx.governance supports the following drift v2 metrics:
Measures the change in distribution of the LLM predicted classes.
Fairness evaluation metrics
Copy link to section
Fairness evaluation metrics can you help you determine whether your model produces biased outcomes. You can use these metrics to identify when your model shows a tendency to provide favorable outcomes more often for one group over another. Watsonx.governance
supports the following fairness evaluation metrics:
Compares the rate that monitored groups are selected to receive favorable outcomes to the rate that reference groups are selected to receive favorable outcomes.
Compares the percentage of favorable outcomes for monitored groups to reference groups.
Generative AI quality evaluation metrics
Copy link to section
Generative AI quality evaluation metrics can help you measure how well your foundation model performs tasks. Watsonx.governance supports the following generative AI quality evaluation metrics:
Table 4. Generative AI quality evaluation metric descriptions
Compares translated sentences from machine translations to sentences from reference translations to measure the similarity between reference texts and predictions
Evaluates the output of a model against SuperGLUE datasets by measuring the F1 score, precision, and recall against the model predictions and its ground truth data
Watsonx.governance also supports the following different categories of generative AI quality metrics:
Answer quality metrics
Copy link to section
You can use answer quality metrics to evaluate the quality of model answers. Answer quality metrics are calculated with LLM-as-a-judge models. To calculate the metrics with LLM-as-a-judge models, you can create a scoring function that calls
the models. For more information, see the Computing Answer Quality and Retrieval Quality Metrics using IBM watsonx.governance for RAG task notebook.
You can calculate the following answer quality metrics:
Measures how grounded the model output is in the model context and provides attributions from the context to show the most important sentences that contribute to the model output.
Measures how much shorter the summary is when compared to the input text by calculating the ratio between the number of words in the original text and the number of words in the foundation model output
Measures the extent that the foundation model output is generated from the model input by calculating the percentage of output text that is also in the input
Measures how extractive the summary in the foundation model output is from the model input by calculating the average of extractive fragments that closely resemble verbatim extractions from the original text
Measures the percentage of n-grams that repeat in the foundation model output by calculating the number of repeated n-grams and the total number of n-grams in the model output
Data safety metrics
Copy link to section
You can use the following data safety metrics to identify whether your model's input or output contains harmful or sensitive information:
Table 7. Data safety evaluation metric descriptions
Measures if your model input or output data contains any personally identifiable information by using the Watson Natural Language Processing entity extraction model
Multi-label/class metrics
Copy link to section
You can use the following multi-label/class metrics to measure model performance for multi-label/multi-class predictions:
The ratio of the number of correct predictions over all classes to the number of true samples.
Retrieval quality metrics
Copy link to section
You can use the retrieval quality metrics to measure the quality of how the retrieval system ranks relevant contexts. Retrieval quality metrics are calculated with LLM-as-a-judge models. To calculate the metrics with LLM-as-a-judge models,
you can create a scoring function that calls the models. For more information, see the Computing Answer Quality and Retrieval Quality Metrics using IBM watsonx.governance for RAG task notebook.
You can calculate the following retrieval quality metrics:
Measures the quanity of relevant contexts from the total of contexts that are retrieved
Model health monitor evaluation metrics
Copy link to section
Model health monitor evaluation metrics can help you understand your model behavior and performance by determining how efficiently your model deployment processes your transactions. Model health evaluation metrics are enabled by default for
machine learning model evaluations in production and generative AI asset deployments. Watsonx.governance supports the following model health monitor evaluation metrics:
Table 10. Model health monitor evaluation metric descriptions
The total, average, minimum, maximum, and median payload size of the transaction records that your model deployment processes across scoring requests in kilobytes (KB)
Calculates the total, average, minimum, maximum, and median output token count across scoring requests during evaluations
Throughput and latency
Copy link to section
Model health monitor evaluations calculate latency by tracking the time that it takes to process scoring requests and transaction records per millisecond (ms). Throughput is calculated by tracking the number of scoring requests and transaction
records that are processed per second.
The following metrics are calculated to measure thoughput and latency during evaluations:
Table 12. Model health monitor throughput and latency metric descriptions
About cookies on this siteOur websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising.For more information, please review your cookie preferences options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.