Introduction

The extract method in Retab’s document processing pipeline uses AI models to extract structured data from any document based on a provided JSON schema. This endpoint is ideal for automating data extraction tasks, such as pulling key information from invoices, forms, receipts, images, or scanned documents, for use in workflows like data entry automation, analytics, or integration with databases and applications. The typical extraction workflow follows these steps:
  1. Schema Definition: Define a JSON schema that describes the structure of the data you want to extract.
  2. Extraction: Use Retab’s extract method to process the document and retrieve structured data.
  3. Validation & Usage: Validate the extracted data (optionally using likelihood scores) and integrate it into your application.
For advanced validation or post-processing, we recommend combining this with schema validation libraries like Pydantic (Python) or Zod (JavaScript) to ensure data integrity. Unlike the parse method that focuses on raw text extraction, extract provides:
  • Structured Output: Data extracted directly into JSON format matching your schema.
  • AI-Powered Inference: Handles complex layouts, handwritten text, and contextual understanding.
  • Modality Support: Works with text, images, or native document formats.
  • Consensus Mode: Optional multi-run consensus for higher accuracy on ambiguous documents.
  • Likelihood Scores: Provides confidence scores for each extracted field.
  • Batch Processing Ready: Efficient for high-volume extraction tasks.

Extract API

ExtractRequest
ExtractRequest
Returns
ParsedChatCompletion
A ParsedChatCompletion object with the extracted data, usage details, and confidence scores.
from retab import Retab

client = Retab()

# doc_msg = reclient.documents.extractions.stream(...) for stream mode
doc_msg = reclient.documents.extract(
    document = "freight/booking_confirmation.jpg", 
    model="gpt-4.1-nano",
    json_schema = {
      'X-SystemPrompt': 'You are a useful assistant.',
      'properties': {
          'name': {
              'description': 'The name of the calendar event.',
              'title': 'Name',
              'type': 'string'
          },
          'date': {
              'description': 'The date of the calendar event in ISO 8601 format.',
              'title': 'Date',
              'type': 'string'
          }
      },
      'required': ['name', 'date'],
      'title': 'CalendarEvent',
      'type': 'object'
    },
    modality="text",
    n_consensus=1 # 1 means disabled (default), if greater than 1 it will run the extraction with n-consensus mode
)

Use Case: Extracting Event Information from Documents

Extract structured calendar event data from a booking confirmation image and validate confidence scores before saving to a database.
from retab import Retab
from pydantic import BaseModel, ValidationError

client = Retab()

# Define Pydantic model matching the schema for validation
class CalendarEvent(BaseModel):
    name: str
    date: str  # ISO 8601

# Extract data
result = client.documents.extract(
    document="freight/booking_confirmation.jpg",
    model="gpt-4.1-nano",
    json_schema={
        'X-SystemPrompt': 'You are a useful assistant.',
        'properties': {
            'name': {
                'description': 'The name of the calendar event.',
                'title': 'Name',
                'type': 'string'
            },
            'date': {
                'description': 'The date of the calendar event in ISO 8601 format.',
                'title': 'Date',
                'type': 'string'
            }
        },
        'required': ['name', 'date'],
        'title': 'CalendarEvent',
        'type': 'object'
    },
    modality="text",
    n_consensus=1
)

# Access extracted data
extracted_data = result.content.choices[0].message.parsed
likelihoods = result.content.likelihoods

# Validate with Pydantic
try:
    event = CalendarEvent(**extracted_data)
    print(f"Extracted Event: {event.name} on {event.date}")
    
    # Check confidence
    if all(score > 0.7 for score in likelihoods.values()):
        print("High confidence extraction - Saving to DB...")
        # db.save(event)  # Pseudo-code for DB integration
    else:
        print("Low confidence - Review manually")
except ValidationError as e:
    print(f"Validation failed: {e}")

print(f"Processed with {result.content.usage.total_tokens} tokens")

Best Practices

Model Selection

  • gpt-4.1-nano: Balanced for accuracy and cost, recommended for most extraction tasks.
  • gemini-2.5-pro: Use for complex documents requiring deep contextual understanding.
  • gemini-2.5-flash: Faster and cheaper for simple extractions or high-volume processing.

Schema Design

  • Keep schemas concise: Only include required fields to improve extraction accuracy.
  • Use descriptive description fields: Helps the AI model understand what to extract.
  • Add X-SystemPrompt for custom guidance: E.g., “Focus on freight details” for domain-specific extractions.

Confidence Handling

  • Set a threshold (e.g., 0.7) for automated processing.
  • For critical tasks, enable n_consensus > 1 to average results and boost reliability.

Modality Choice

  • Text: For clean, text-heavy documents.
  • Native: For PDFs or images with layouts preserved.