Removing harmful language from model input and output
AI guardrails removes potentially harmful content, such as hate speech, abuse, and profanity, from foundation model output and input.
Capabilities
AI guardrails is powered by AI that uses sentence classifiers to the input provided to a foundation model input and the output text generated by the model.
The sentence classifier breaks the model input and output text into sentences, and then reviews each sentence to find and flag harmful content. The classifier assesses each word, relationships among the words, and the context of the sentence to determine whether a sentence contains harmful language. The classifier then assigns a score that represents the likelihood that inappropriate content is present.
AI guardrails are enabled automatically when you inference natural-language foundation models.
When you use AI guardrails in the Prompt Lab and click Generate, the filter checks all model input and output text. Inappropriate text is handled in the following ways:
-
Input text that is flagged as inappropriate is not submitted to the foundation model. The following message is displayed instead of the model output:
[The input was rejected as inappropriate]
-
Model output text that is flagged as inappropriate is replaced with the following message:
[Potentially harmful text removed]
Restrictions
- AI guardrails can detect harmful content in English text only.
- You cannot apply AI guardrails with programmatic-language foundation models.
Ways to work
You can remove harmful content when you're working with foundation models with the following methods:
- From the Prompt Lab. For details, see Configuring AI guardrails in the Prompt Lab
- Programmatically with the following methods:
AI guardrails settings
You can configure the following filters to apply to the user input and model output and adjust the filter sensitivity, if applicable:
Hate, abuse, and profanity (HAP) filter
The HAP filter, which is also referred to as a HAP detector, is a sentence classifier created by fine tuning a large language model from the IBM Slate family of encoder-only natural language processing (NLP) models built by IBM Research.
Use the HAP filter to detect and flag the following types of language:
-
Hate speech: Expressions of hatred toward an individual or group based on attributes such as race, religion, ethnic origin, sexual orientation, disability, or gender. Hate speech shows an intent to hurt, humiliate, or insult the members of a group or to promote violence or social disorder.
-
Abusive language: Rude or hurtful language that is meant to bully, debase, or demean someone or something.
-
Profanity: Toxic words such as expletives, insults, or sexually explicit language.
You can use the HAP filter for user input and model output independently.
You can change the filter sensitivity by setting a threshold. The threshold represents the value that scores generated by the HAP classifier must reach for the content to be considered harmful. The score threshold ranges from 0.0 to 1.0.
A lower value, such as 0.1 or 0.2, is safer because the threshold is lower. Harmful content is more likely to be identified when a lower score can trigger the filter. However, the classifier might also be triggered when content is safe.
A value closer to 1, such as 0.8 or 0.9, is more risky because the score threshold is higher. When a higher score is required to trigger the filter, occurrences of harmful content might be missed. However, the content that is flagged as harmful is more likely to be harmful.
To disable AI guardrails, set the HAP threshold value to 1
.
Personal identifiable information (PII) filter
The PII filter uses a NLP AI model to identify and flag content. For the full list of entity types that are flagged, see Rule-based extraction for general entities.
Use the HAP filter to control whether personally identifiable information, such as phone numbers and email addresses, is filtered out from the user input and foundation model output. You can set HAP filters for user input and model output independently.
The PII filter threshold value is set to 0.8 and you cannot change the sensitivity of the filter.
Using a Granite Guardian model as a filter 
The Granite Guardian foundation model comes from the Granite family of models by IBM. The model is a significantly more powerful guardrail filter designed to deliver advanced protection against harmful content.
Use the Granite Guardian model as a filter to detect and flag the following types of language:
-
Social bias: Prejudiced statements based on identity or characteristics.
-
Jailbreaking: Attempts to manipulate AI to generate harmful, restricted, or inappropriate content.
-
Violence: Promotion of physical, mental, or sexual harm.
-
Profanity: Use of offensive language or insults.
-
Unethical behavior: Actions that violate moral or legal standards.
-
Harm engagement: Engagement or endorsement of harmful or unethical requests.
-
Evasiveness: Avoiding to engage without providing sufficient reason.
You can use the Granite Guardian model as a filter for user input only.
You can change the filter sensitivity by setting a threshold. The threshold represents the score value that content must reach to be considered harmful. The score threshold ranges from 0.0 to 1.0.
A lower value, such as 0.1 or 0.2, is safer because the threshold is lower. Harmful content is more likely to be identified when a lower score can trigger the filter. However, the classifier might also be triggered when content is safe.
A value closer to 1, such as 0.8 or 0.9, is more risky because the score threshold is higher. When a higher score is required to trigger the filter, occurrences of harmful content might be missed. However, the content that is flagged as harmful is more likely to be harmful.
To disable AI guardrails, set the Granite Guardian threshold value to 1
.
Configuring AI guardrails in the Prompt Lab
To remove harmful content when you're working with foundation models in the Prompt Lab, set the AI guardrails switcher to On.
The AI guardrails feature is enabled automatically for all natural language foundation models in English.
To configure AI guardrails in the Prompt Lab, complete the following steps:
-
With AI guardrails enabled, Click the AI guardrails settings icon
.
-
You can configure different filters to apply to the user input and model output and adjust the filter sensitivity, if applicable.
-
HAP filter
To disable AI guardrails, set the HAP slider to
1
. To change the sensitivity of the guardrails, move the HAP sliders. -
PII filter
To enable the PII filter, set the PII switcher to On.
-
Granite Guardian model as a filter
Granite Guardian moderation is disabled by default. To change the sensitivity of the guardrails, move the Granite Guardian sliders.
Experiment with adjusting the sliders to find the best settings for your needs.
-
-
Click Save.
Configuring AI guardrails programmatically
You can set AI guardrails programmatically to moderate the input text provided to a foundation model and the output generated by the model in multiple ways.
REST API
You can use the following watsonx.ai API endpoints to configure and apply AI guardrails to natural language input and output text:
- When you inference a foundation model by using the text generation API, you can use the
moderations
field to apply filters to the foundation model input and output. For more information, see Text generation in the watsonx.ai API reference documentation. - When you verify content by using the text detection API, you can use the
detectors
field to apply filters to the text. For more information, see Text detection in the watsonx.ai API reference documentation.
Python
You can use the watsonx.ai Python SDK to configure and apply AI guardrails to natural language input and output text in the following ways:
-
Adjust the AI guardrails filters with the Python library when you inference the foundation model by using the text generation API. For details, see Inferencing a foundation model programmatically (Python).
-
Adjust the AI guardrails filters with the Python library when you inference the foundation model by using the text detection API. For more information, see the Guardian class of the watsonx.ai Python library.
The following code example shows you how to configure and use the filters with the text detection API:
from ibm_watsonx_ai import APIClient, Credentials from ibm_watsonx_ai.foundation_models.moderations import Guardian credentials = Credentials( url = "https://{region}.ml.cloud.ibm.com", api_key ="{my-IBM-Cloud-API-key}" ) api_client = APIClient(credentials, space_id="{my-space-ID}") detectors = { "granite_guardian": {"threshold": 0.4}, "hap": {"threshold": 0.4}, "pii": {}, } guardian = Guardian( api_client=api_client, # required detectors=detectors # required )
To use the custom filter with the Python library, include the following parameter in the text detection request:
text = "I would like to say some `Indecent words`." response = guardian.detect( text=text, # required detectors=detectors # optional )
For more information, see watsonx.ai Python SDK.
Learn more
- Techniques for avoiding undesirable output
- watsonx.ai API reference documentation
- AI risk atlas
- Security and privacy
Parent topic: Building prompts