Extract text to convert high-quality business documents into a simpler file format that can be used by AI models or to find and isolate key pieces of information from documents such as contracts.
Simplifying your business documents by converting them into a text-based format is especially useful for retrieval-augmented generation tasks where you want to find information that is relevant to a user query and include it with the input to
a foundation model. Including accurate contextual information in model input helps the foundation model to incorporate factual and up-to-date information in the model output. For more information, see Retrieval-augmented generation (RAG).
Capabilities
Copy link to section
The document understanding technology uses the following methods to extract text:
Optical character recognition
Optical character recognition (OCR) extracts text from images, scanned documents, and tables, and is useful for preserving information that is depicted in images, diagrams, or in text that is embedded in files such as scanned PDFs. Although
optical character recognition can extract text from noisy images, the quality of the image files must meet the minimum requirement of 80 DPI (dots per inch).
Document structure identification
The text extraction API processes document content from various data structures including tables, section titles, bulleted lists, paragraphs, and footnotes. The API also identifies and removes commonly used content such as headers and footers.
Key-value pair extraction
Use key-value pair extraction to process documents that contain generic or domain-specific structured data, like invoices, utility bills, and more. The extraction mode classifies documents based on the document type. The extracted text is
stored in a datastructure called a schema where each piece of data (the value) is associated with a unique identifier (the key). The mode uses a pre-defined schema or a custom schema that you define. Key-value pairs are extracted with large
language models (LLMs) and advanced vision-language processing.
Requirements
Copy link to section
If you signed up for watsonx.ai and you have a sandbox project, all requirements are met and you're ready to use the text extraction service.
You must meet the following requirements:
You must have a project.
The project must have an associated watsonx.ai Runtime service instance.
Required permissions
To run a text extraction job, you must have the Admin or Editor role in a project.
Text extraction is available with paid plans only. Billing is based on the number of pages that are processed. For details, see Billing details for generative AI assets.
Required credentials
Create a task credential. A task credential is an API key that is used to authenticate long-running jobs that are started by steps you perform in the text extraction procedure. You do not need to pass the task credential in the API request.
For details, see Creating task credentials.
Supported input file types
You can extract text from documents in different languages, or from a document that has a mix of multiple languages. Extract text from the following file types:
PDF
GIF
JPG
PNG
TIFF
BMP
DOC
DOCX
HTML
JFIF
PPT
PPTX
Supported output file types
You can store the extracted text in the following formats:
JSON
Markdown
HTML
TXT
For details about the contents of the extracted result in each output file type, see Specifying the output format.
Restrictions
Copy link to section
You can extract text from specific input file types and store the extracted output in certain file types. Every input file type cannot be extracted into every supported output format. The following table provides details about which input
file type is compatible with the various output formats:
Input file type and extracted output format compatibility for the text extraction API
Input file type
Compatible output file formats
Programmatic PDF
All formats
Scanned PDF
All formats
Image
All formats
Microsoft Word file
All formats
Microsoft PowerPoint file
All formats
HTML file
Markdown
Key-value pair extraction is only supported for English language documents.
Ways to work
Copy link to section
You can extract text from documents stored in your watsonx.ai project with these programmatic methods: