Removing harmful language from model input and output

Last updated: Jul 31, 2025

AI guardrails removes potentially harmful content, such as hate speech, abuse, and profanity, from foundation model output and input.

Capabilities

AI guardrails is powered by AI that uses sentence classifiers to the input provided to a foundation model input and the output text generated by the model.

The sentence classifier breaks the model input and output text into sentences, and then reviews each sentence to find and flag harmful content. The classifier assesses each word, relationships among the words, and the context of the sentence to determine whether a sentence contains harmful language. The classifier then assigns a score that represents the likelihood that inappropriate content is present.

AI guardrails are enabled automatically when you inference natural-language foundation models.

When you use AI guardrails in the Prompt Lab and click Generate, the filter checks all model input and output text. Inappropriate text is handled in the following ways:

Input text that is flagged as inappropriate is not submitted to the foundation model. The following message is displayed instead of the model output:

[The input was rejected as inappropriate]
Model output text that is flagged as inappropriate is replaced with the following message:

[Potentially harmful text removed]

Restrictions

AI guardrails can detect harmful content in English text only.
You cannot apply AI guardrails with programmatic-language foundation models.

Ways to work

You can remove harmful content when you're working with foundation models with the following methods:

From the Prompt Lab. For details, see Configuring AI guardrails in the Prompt Lab
Programmatically with the following methods:
- REST API
- Python

AI guardrails settings

You can configure the following filters to apply to the user input and model output and adjust the filter sensitivity, if applicable:

Hate, abuse, and profanity (HAP) filter

The HAP filter, which is also referred to as a HAP detector, is a sentence classifier created by fine tuning a large language model from the IBM Slate family of encoder-only natural language processing (NLP) models built by IBM Research.

Use the HAP filter to detect and flag the following types of language:

Hate speech: Expressions of hatred toward an individual or group based on attributes such as race, religion, ethnic origin, sexual orientation, disability, or gender. Hate speech shows an intent to hurt, humiliate, or insult the members of a group or to promote violence or social disorder.
Abusive language: Rude or hurtful language that is meant to bully, debase, or demean someone or something.
Profanity: Toxic words such as expletives, insults, or sexually explicit language.

You can use the HAP filter for user input and model output independently.

You can change the filter sensitivity by setting a threshold. The threshold represents the value that scores generated by the HAP classifier must reach for the content to be considered harmful. The score threshold ranges from 0.0 to 1.0.

A lower value, such as 0.1 or 0.2, is safer because the threshold is lower. Harmful content is more likely to be identified when a lower score can trigger the filter. However, the classifier might also be triggered when content is safe.

A value closer to 1, such as 0.8 or 0.9, is more risky because the score threshold is higher. When a higher score is required to trigger the filter, occurrences of harmful content might be missed. However, the content that is flagged as harmful is more likely to be harmful.

To disable AI guardrails, set the HAP threshold value to 1.

Personal identifiable information (PII) filter

The PII filter uses a NLP AI model to identify and flag content. For the full list of entity types that are flagged, see Rule-based extraction for general entities.

Use the PII filter to control whether personally identifiable information, such as phone numbers and email addresses, is filtered out from the user input and foundation model output. You can set PII filters for user input and model output independently.

The PII filter threshold value is set to 0.8 and you cannot change the sensitivity of the filter.

Using a Granite Guardian model as a filter

The Granite Guardian foundation model comes from the Granite family of models by IBM. The model is a significantly more powerful guardrail filter designed to deliver advanced protection against harmful content.

Note:

The Granite Guardian model as a filter is currently in beta, and therefore does not incur any charges. Granite Guardian moderation is disabled by default.

Use the Granite Guardian model as a filter to detect and flag the following types of language:

Social bias: Prejudiced statements based on identity or characteristics.
Jailbreaking: Attempts to manipulate AI to generate harmful, restricted, or inappropriate content.
Violence: Promotion of physical, mental, or sexual harm.
Profanity: Use of offensive language or insults.
Unethical behavior: Actions that violate moral or legal standards.
Harm engagement: Engagement or endorsement of harmful or unethical requests.
Evasiveness: Avoiding to engage without providing sufficient reason.

Important: The Granite Guardian filter uses the complete chat history to understand if a prompt is unsafe. If you include a message that is flagged as a 'risk', all subsequent messages fail the safety check.

You can use the Granite Guardian model as a filter for user input only.

You can change the filter sensitivity by setting a threshold. The threshold represents the score value that content must reach to be considered harmful. The score threshold ranges from 0.0 to 1.0.

To disable AI guardrails, set the Granite Guardian threshold value to 1.

Configuring AI guardrails in the Prompt Lab

To remove harmful content when you're working with foundation models in the Prompt Lab, set the AI guardrails switcher to On.

The AI guardrails feature is enabled automatically for all natural language foundation models in English.

To configure AI guardrails in the Prompt Lab, complete the following steps:

With AI guardrails enabled, Click the AI guardrails settings icon .
You can configure different filters to apply to the user input and model output and adjust the filter sensitivity, if applicable.
- HAP filter
  
  To disable AI guardrails, set the HAP slider to 1. To change the sensitivity of the guardrails, move the HAP sliders.
- PII filter
  
  To enable the PII filter, set the PII switcher to On.
- Granite Guardian model as a filter
  
  Granite Guardian moderation is disabled by default. To change the sensitivity of the guardrails, move the Granite Guardian sliders.
Experiment with adjusting the sliders to find the best settings for your needs.
Click Save.

Configuring AI guardrails programmatically

You can set AI guardrails programmatically to moderate the input text provided to a foundation model and the output generated by the model in multiple ways.

REST API

Watsonx.ai

You can use the following watsonx.ai API endpoints to configure and apply AI guardrails to natural language input and output text:

When you inference a foundation model by using the text generation API, you can use the moderations field to apply filters to the foundation model input and output. For more information, see Text generation in the watsonx.ai API reference documentation.
When you verify content by using the text detection API, you can use the detectors field to apply filters to the text. For more information, see Text detection in the watsonx.ai API reference documentation.

Watsonx.governance

For watsonx.governance, you can use the following watsonx.governance API endpoints to configure and apply AI guardrails to natural language input and output text:

When you verify content by using the text detection API, you can use the detectors field to apply filters to the text. For more information, see Text detection in the watsonx.gov API reference documentation.

Python

Watsonx.ai

You can use the watsonx.ai Python SDK to configure and apply AI guardrails to natural language input and output text in the following ways:

Adjust the AI guardrails filters with the Python library when you inference the foundation model by using the text generation API. For details, see Inferencing a foundation model programmatically (Python).

Adjust the AI guardrails filters with the Python library when you inference the foundation model by using the text detection API. For more information, see the Guardian class of the watsonx.ai Python library.

The following code example shows you how to configure and use the filters with the text detection API:

from ibm_watsonx_ai import APIClient, Credentials
from ibm_watsonx_ai.foundation_models.moderations import Guardian

credentials = Credentials(
    url = "https://{region}.ml.cloud.ibm.com",
    api_key ="{my-IBM-Cloud-API-key}"
)
api_client = APIClient(credentials, space_id="{my-space-ID}")

detectors = {
    "granite_guardian": {"threshold": 0.4},
    "hap": {"threshold": 0.4},
    "pii": {},
}

guardian = Guardian(
    api_client=api_client,  # required
    detectors=detectors  # required
)

To use the custom filter with the Python library, include the following parameter in the text detection request:

text = "I would like to say some `Indecent words`."

response = guardian.detect(
    text=text,   # required
    detectors=detectors # optional
)

For more information, see watsonx.ai Python SDK.

Watsonx.governance

For watsonx.governance, you can use the watsonx.governance Python SDK to configure and apply AI guardrails to natural language input and output text in the following ways:

Adjust the AI guardrails filters with the Python library when you inference the foundation model by using the text detection API. For more information, see the SDK documentation of the watsonx.governance Python library.

The following code example shows you how to configure and use the filters with the text detection API:
```
import os

from ibm_watsonx_gov.evaluators import MetricsEvaluator
from ibm_watsonx_gov.metrics import (HAPMetric, PIIMetric, HarmMetric)

os.environ["WATSONX_APIKEY"] = "<IBM_CLOUD_APIKEY>"

evaluator = MetricsEvaluator()
text = "How can I steal someone's credit card information and use it?"

result = evaluator.evaluate(data={"input_text": text}, metrics=[PIIMetric(), HAPMetric(), HarmMetric()])
result.to_df()
```
For more information about how to use the Python SDK to invoke the AI guardrails, see Notebook on invoking Guardrails with IBM watsonx.governance.

Learn more

Parent topic: Building prompts

Was the topic helpful?

0/1000