Text extraction parameters

Last updated: Jul 30, 2025

When you submit a text extraction request by using the watsonx.ai REST API, you include a payload that specifies configuration details for the text extraction operation.

Make choices about the various text extraction parameters that meet your requirements in the REST API request body:

Format in which to store the extracted text
Quality and speed of text extraction
Language of the input text
Include text from images in the extracted output
Include key-value pairs in the extracted output

For details about the different parameters you can set to customize your text extraction REST API request, see the watsonx.ai API reference documentation.

Specifying the output format

By default, the extracted text is written in plain text. If you want the extracted text to be written in another format, such as Markdown, specify the following parameter in the API request body:

"parameters": {
  "requested_outputs": [
    "md"
  ]
}

The following table provides details about the different output formats generated by the text extraction process when you specify the requested_outputs paramater in your API request:

Requested output formats in the text extraction API
Requested output	Generated file type	Description
`md`	Markdown	Extract information is serialized in Markdown format. Data structures such as section titles, tables, and paragraphs are represented using Markdown tags. The result does not contain key-value pair data.
`html`	HTML	Extracted information is serialized in HTML format. Data structures such as section titles, tables, and paragraphs are represented using Markdown tags. The result does not contain key-value pair data.
`plain_text`	Plain text	Extracted information is serialized in plain text format. The result only contains unstructured text. The result does not contain tables, section titles, or key-value pair data.
`assembly`	JSON	Extract text into a JSON format. The result contains all unstructured text and data structures such as tables, key-value pairs, and visual bounding box information.
`page_images`	PNG	Extract each page of the document into a separate image

Processing mode

You can control the speed at which your text extraction request is processed by setting the mode parameter in your API request.

"parameters": {
  "mode": "standard"
}

The high quality processing mode preserves all data structures in your document but may take longer to process than the standard mode. In the standard mode, the extraction request completes faster but generates lower quality output that may lack details.

For details about the different processing modes, see the watsonx.ai API reference documentation.

Supported languages

If your document is in a language other than English, you must specify the language by its ISO 639 language code in the languages parameter of your API request.

"parameters": {
  "languages": [
    "de"
  ]
}

If the document has a mix of languages, list each language separately.

Note: You cannot extract text from a mixed-language document when the languages do not share a common script. However, you can use documents with a mix of English and one other language in any script.

For example, you can extract text from images in a document with a mix of English and French text because both languages are Latin based. However, you cannot extract text from images in a document with a mix of Japanese and French text.

The language code you specify differs based on whether your document contains machine-printed text or handwriting.

Supported handwritten languages

If your document contains text in English handwriting, use the en_hw language code in your API request body.

Supported machine-printed languages

The following table provides details about the languages supported by the text extraction API for printed text recognition:

Note: If your document language does not have an ISO 639 language code listed, use the API script code.

Machine-printed languages supported in the text extraction API
Language	ISO 639 language code	API script code	Script
Acehnese	‐	`latn`	Latin
Afrikaans	`af`	`latn`	Latin
Albanian	`sq`	`latn`	Latin
Araucanian/Mapuche	‐	`latn`	Latin
Awadhi	‐	`deva`	Devanagari
Aymara	`ay`	`latn`	Latin
Balinese	‐	`latn`	Latin
Baso Minangkabau	‐	`latn`	Latin
Basque	`eu`	`latn`	Latin
Belarusian	`be`	`cyrl`	Cyrillic
Bemba	‐	`latn`	Latin
Bikol	‐	`latn`	Latin
Bislama	`bi`	`latn`	Latin
Bhojpuri	‐	`deva`	Devanagari
Bulgarian	`bg`	`cyrl`	Cyrillic
Catalan	`ca`	`latn`	Latin
Cebuano	‐	`latn`	Latin
Chechen	‐	`cyrl`	Cyrillic
Chinese (Simplified)	`zh_cn`	`cjk`	Han (Simplified)
Chinese (Traditional)	`zh_tw`	`cjk`	Han (Traditional)
Choctaw	‐	`latn`	Latin
Cree	`cr`	`latn`	Latin
Dakota	‐	`latn`	Latin
Danish	`da`	`latn`	Latin
Dogri	‐	`deva`	Devanagari
Dutch	`nl`	`latn`	Latin
English	`en`	`latn`	Latin
Estonian	`et`	`latn`	Latin
Fijian	`fj`	`latn`	Latin
Filipino	`fil`	`latn`	Latin
Finnish	`fi`	`latn`	Latin
French	`fr`	`latn`	Latin
Galician	`gl`	`latn`	Latin
Gayo	‐	`latn`	Latin
German	`de`	`latn`	Latin
Gilbertese	‐	`latn`	Latin
Greek	`el`	`el`	Greek
Haitian Creole	`ht`	`latn`	Latin
Hebrew	`he`	`he`	Hebrew
Hiligaynon	‐	`latn`	Latin
Hindi	`hi`	`deva`	Devanagari
Iban	‐	`latn`	Latin
Iloko	‐	`latn`	Latin
Indonesian	`id`	`latn`	Latin
Irish	`ga`	`latn`	Latin
Italian	`it`	`it`	Latin
Japanese	`ja`	`cjk`	Japanese
Javanese	`jv`	`latn`	Latin
Kachin	‐	`latn`	Latin
Kalaallisut	`kl`	`latn`	Latin
Kanienʼkéha	‐	`latn`	Latin
Khasi	‐	`latn`	Latin
Kinyarwanda	`rw`	`latn`	Latin
Konkani	‐	`deva`	Devanagari
Kongo	`kg`	`latn`	Latin
Korean	`ko`	`cjk`	Korean
Kosraean	‐	`latn`	Latin
Kuanyama	`kj`	`latn`	Latin
Latin	`la`	`latn`	Latin
Lozi	‐	`latn`	Latin
Low German	‐	`latn`	Latin
Luo	‐	`latn`	Latin
Malagasy	`mg`	`latn`	Latin
Maithili	‐	`deva`	Devanagari
Manx	`gv`	`latn`	Latin
Marathi	`mr`	`deva`	Devanagari
Middle English	‐	`latn`	Latin
Mittelhochdeutsch	‐	`latn`	Latin
Macedonian	`mk`	`cyrl`	Cyrillic
Ndonga	`ng`	`latn`	Latin
Nepali	`ne`	`deva`	Devanagari
NorthNdebele	`nd`	`latn`	Latin
Norwegian	`no`	`no`	Latin
Nyankole	‐	`latn`	Latin
Occitan	`oc`	`latn`	Latin
Ojibwa	`oj`	`latn`	Latin
Old English	‐	`latn`	Latin
Old French	‐	`latn`	Latin
Old High German	‐	`latn`	Latin
Old Norse	‐	`latn`	Latin
Old Provençal	‐	`latn`	Latin
Pampanga	‐	`latn`	Latin
Pangasinan	‐	`latn`	Latin
Papiamento	‐	`latn`	Latin
Polish	`pl`	`latn`	Latin
Portuguese	`pt`	`pt`	Latin
Quechua	`qu`	`latn`	Latin
Romansh	`rm`	`latn`	Latin
Rundi	`rn`	`latn`	Latin
Russian	`ru`	`cyrl`	Cyrillic
Sango	`sg`	`latn`	Latin
Sanskrit	`sa`	`deva`	Devanagari
Scots	‐	`latn`	Latin
Serbian	`sr`	`cyrl`	Cyrillic
Shona	`sn`	`latn`	Latin
Spanish	`es`	`es`	Latin
Sundanese	`su`	`latn`	Latin
Swahili	`sw`	`latn`	Latin
Swati	`ss`	`latn`	Latin
Swedish	`sv`	`sv`	Latin
Tamil	`ta`	`deva`	Tamil
Telugu	`te`	`deva`	Telugu
Tsonga	`ts`	`latn`	Latin
Tswana	`tn`	`latn`	Latin
Ukrainian	`uk`	`cyrl`	Cyrillic
Uzbek	`uz`	`cyrl`	Cyrillic
Xhosa	`xh`	`latn`	Latin
Zulu	`zu`	`latn`	Latin

Extracting text from images

You can specify how you to process text in images in your document by using optical character recognition (OCR). Specify the following parameter in the API request body:

"parameters": {
  "ocr_mode": "enabled"
}

For details about the different OCR modes, see the watsonx.ai API reference documentation.

You can also configure how to process images embedded in your document and convert them to Markdown and JSON formats.

The embedded image is the area on a page of the document that represents only the picture without including portions of the page that contain text or tables. Text and tables in the original document are processed with OCR. The embedded images extraction mode is used to specify how to serialize images in the document and preserve them in the extracted output.

Based on the embedded images extraction mode you specify, you can choose how embedded images are represented in the output:

Whether to include images in the extracted output. If images are included, they are stored in the embedded_images_assembly folder as .png files
Whether generic placeholder text or the text extracted by OCR directly from the image appears in the Markdown and JSON output formats
Whether image is verbalized by describing the image in natural language. For example, an image of a cat may be verbalized as The image displays a cat resting on the floor.

To extract embedded images including text that describes the images, specify the following parameter in the API request body:

"parameters": {
  "create_embedded_images": "enabled_verbalization"
}

Images extracted in a JSON output format are represented in the Picture object. Based on the embedded images mode you specify, the following attributes in the JSON object are used to store the image details:

text : Stores a string that contains the text extracted directly from the image
verbalization : Stores a string that contains the textual description of the image.
children_ids : Each word in the text releated to an image is represented as tokens and stored as a list of token IDs.

For details about the JSON output schema, see Text extraction JSON schema.

The following table provides details about the different modes you can use in your API request to extract embedded images:

Embedded images extraction modes in the text extraction API
Mode	Usage	Image (in bytes) in output	Markdown output details	JSON output details
`disabled`	Suited for an application that does not need to include images in the output. OCR processes tables and other data structures in the document.	No	None	None
`enabled_placeholder`	Suited for an application that needs to process images, but does not require image description and use a custom im,age verbalizer to generate image descriptions.	✓	Link to image location	• Image in the `pictures` structure • `picture.text` is empty • List of token IDs that represent generic placeholder text in `picture.children_ids`
`enabled_text`	Suited for an application that needs to process images, but does not require image description and use a custom im,age verbalizer to generate image descriptions.	✓	Text is extracted from the image	• Image in the `pictures` structure • Text extracted directly from the image in `picture.text` • List of token IDs that represent text extracted from the image in `picture.children_ids`
`enabled_verbalization`	Suited for an application that uses image descriptions to implements image search.	✓	• Link to image location • Textual description of the image	• Image in the `pictures` structure • Textual description of the image in `picture.verbalization` only if the image was verbalized in the original document • List of token IDs that represent the textual description of the image
`enabled_verbalization_all`	Suited for an application that uses image descriptions to implements image search.	✓	• Link to image location • Textual description of the image	• Image in the `pictures` structure • Textual description of the image in `picture.verbalization` only if the image was verbalized in the original document • List of token IDs that represent the textual description of the image

Extracting text in key-value pairs

You can choose to extract text as key-value pairs from documents that contain domain-specific structured data. The extracted text is stored in a format where each piece of data (the value) is associated with a unique identifier (the key). Key-value pair data is extracted by using a general-purpose foundation model or a model that is tuned for specific document formats.

The following restrictions apply when you use the key-value pair extraction capability:

Key-value pair data extraction is only supported for English language documents.
The result of the key-value pair extraction is only available in the assembly output format. Key-value pairs are not extracted in the html, markdown, or plain_text output formats.

Based on the contents of your input document, you can extract key-value pair data with one of the following methods:

Generic key-value pair extraction: The generic extraction process identifies and extracts all key-value pairs in a document. This method is useful for extracting labeled information without needing to know details about specific fields in advance.
Schema-based (Fixed) key-value pair extraction: The schema-based process targets specific, pre-defined fields in documents by using built-in schemas for common document types like invoices, utility bills, passports, and more. Every page is classified into one of the supported schema types. Based on the classification, text is extracted into the key-value pair format defined in the schema for the specific document type. By classifying the document first, this method increases accuracy for known document types without requiring dedicated model training.

For example, if you want to extract text as key-value pair data with a general purpose foundation model, specify the following parameter in the API request body:

"parameters": {
  "kvp_mode": "generic_with_semantic"
}

If you set the enable_generic_kvp to True, a value may be extracted twice when the model uses both the generic and schema-based extraction methods. If you only want to extract generic key-value pair data, set the enable_generic_kvp to False.

If you do not specify the kvp_mode parameter in your text extraction API request, no key-value pairs are extracted from your document.

Key-value pairs extraction modes

You can specify one of the following modes in your API request to extract key-value pair data from your document:

invoice

Extract text from an invoice with a specialized model in a key-value pair format. The model is trained with datasets that contain various invoices.

The following attributes are extracted from an invoice in the invoice mode:

Invoice Date
Invoice Total
Invoice Number
Bill To Name
Bill To Address
Vendor Name
Vendor Address
Payment Terms
Payment Due Date
PO Number
Ship To Name
Ship To Address
Shipping Amount
Tax Amount
Sub Total
Tax Type
Tax Rate
Bank name
Bank Account Number

ubill

Extract text from a utility bill with a specialized model in a key-value pair format. The model is trained with datasets that contain various utility bills.

The following attributes are extracted from a utility bill in the ubill mode:

Account Number
Amount Due
Company Name
Company Address
Customer Name
Customer Address
Due Date
Payment Received
Previous Balance
Service Address
Statement Date

generic_with_semantic

Extract generic labelled data and domain-specific data with a general purpose model into a key-value pair format. If pages in your document can be classified into one of several pre-defined schemas, domain-specific data is stored in the fields defined in that schema. For pages that do not fit into one of the pre-defined templates, the key-values pairs are extracted in a generic format without specific labels. The pixtral-12b model is used to generate the generic and schema-based key-value pairs in this mode.

Restriction:

The generic_with_semantic mode setting is not available in the Toronto and Sydney regions.

The API extracts text from the following document types into pre-defined schemas in the generic_with_semantic mode:

If your documents contains unique structured content, you can provide a custom schema that defines specific data and unique identifiers. When you specify a custom schema, the text extration process overrides the pre-defined common document schemas and only uses the schema you provide.

You can provide a custom schema for key-value pair extraction by specifying the semantic_config parameter in your API request. For more information about how to configure custom schema parameters, see Creating custom schemas for key-value pair extraction.

Learn more

Parent topic: Text extraction

Was the topic helpful?

0/1000