Introduction
Theextract
method in Retab’s document processing pipeline uses AI models to extract structured data from any document based on a provided JSON schema. This endpoint is ideal for automating data extraction tasks, such as pulling key information from invoices, forms, receipts, images, or scanned documents, for use in workflows like data entry automation, analytics, or integration with databases and applications.
The typical extraction workflow follows these steps:
- Schema Definition: Define a JSON schema that describes the structure of the data you want to extract.
- Extraction: Use Retab’s
extract
method to process the document and retrieve structured data. - Validation & Usage: Validate the extracted data (optionally using likelihood scores) and integrate it into your application.
parse
method that focuses on raw text extraction, extract
provides:
- Structured Output: Data extracted directly into JSON format matching your schema.
- AI-Powered Inference: Handles complex layouts, handwritten text, and contextual understanding.
- Modality Support: Works with text, images, or native document formats.
- Consensus Mode: Optional multi-run consensus for higher accuracy on ambiguous documents.
- Likelihood Scores: Provides confidence scores for each extracted field.
- Batch Processing Ready: Efficient for high-volume extraction tasks.
Extract API
A ParsedChatCompletion object with the extracted data, usage details, and confidence scores.
Use Case: Extracting Event Information from Documents
Extract structured calendar event data from a booking confirmation image and validate confidence scores before saving to a database.Best Practices
Model Selection
gpt-4.1-nano
: Balanced for accuracy and cost, recommended for most extraction tasks.gemini-2.5-pro
: Use for complex documents requiring deep contextual understanding.gemini-2.5-flash
: Faster and cheaper for simple extractions or high-volume processing.
Schema Design
- Keep schemas concise: Only include required fields to improve extraction accuracy.
- Use descriptive
description
fields: Helps the AI model understand what to extract. - Add
X-SystemPrompt
for custom guidance: E.g., “Focus on freight details” for domain-specific extractions.
Confidence Handling
- Set a threshold (e.g., 0.7) for automated processing.
- For critical tasks, enable
n_consensus > 1
to average results and boost reliability.
Modality Choice
- Text: For clean, text-heavy documents.
- Native: For PDFs or images with layouts preserved.