Introduction
Theextract method in Retab’s document processing pipeline uses AI models to extract structured data from any document based on a provided JSON schema. This endpoint is ideal for automating data extraction tasks, such as pulling key information from invoices, forms, receipts, images, or scanned documents, for use in workflows like data entry automation, analytics, or integration with databases and applications.
The typical extraction workflow follows these steps:
- Schema Definition: Define a JSON schema that describes the structure of the data you want to extract.
- Extraction: Use Retab’s
extractmethod to process the document and retrieve structured data. - Validation & Usage: Validate the extracted data (optionally using likelihood scores) and integrate it into your application.
parse method that focuses on raw text extraction, extract provides:
- Structured Output: Data extracted directly into JSON format matching your schema.
- AI-Powered Inference: Handles complex layouts, handwritten text, and contextual understanding.
- Modality Support: Works with text, images, or native document formats.
- Consensus Mode: Optional multi-run consensus for higher accuracy on ambiguous documents.
- Likelihood Scores: Provides confidence scores for each extracted field.
- Batch Processing Ready: Efficient for high-volume extraction tasks.
Extract API
A ParsedChatCompletion object with the extracted data, usage details, and confidence scores.
Use Case: Extracting Event Information from Documents
Extract structured calendar event data from a booking confirmation image and validate confidence scores before saving to a database.Use Case: Using Additional Messages for Context
Useadditional_messages to provide extra context or specific instructions that help guide the extraction. This is useful when you need to clarify ambiguous fields, provide domain-specific knowledge, or correct the model’s behavior.
Best Practices
Model Selection
gpt-4.1-nano: Balanced for accuracy and cost, recommended for most extraction tasks.gemini-2.5-pro: Use for complex documents requiring deep contextual understanding.gemini-2.5-flash: Faster and cheaper for simple extractions or high-volume processing.
Schema Design
- Keep schemas concise: Only include required fields to improve extraction accuracy.
- Use descriptive
descriptionfields: Helps the AI model understand what to extract. - Add
X-SystemPromptfor custom guidance: E.g., “Focus on freight details” for domain-specific extractions.
Confidence Handling
- Set a threshold (e.g., 0.7) for automated processing.
- For critical tasks, enable
n_consensus > 1to average results and boost reliability.
Modality Choice
- Text: For clean, text-heavy documents.
- Native: For PDFs or images with layouts preserved.
Using Additional Messages
- Use
additional_messagesto provide domain-specific context or clarifications. - Simulate a conversation: Add a user message with context, then an assistant acknowledgment to prime the model.
- Ideal for: currency defaults, date format preferences, handling ambiguous abbreviations, or specifying regional conventions.
- Keep messages concise to avoid diluting the extraction focus.
- You can also use the
create_messagesAPI to convert additional documents into chat messages, then pass them asadditional_messagesfor multi-document context: