Extraction

Introduction

The extract method in Retab’s document processing pipeline uses AI models to extract structured data from any document based on a provided JSON schema. This endpoint is ideal for automating data extraction tasks, such as pulling key information from invoices, forms, receipts, images, or scanned documents, for use in workflows like data entry automation, analytics, or integration with databases and applications. The typical extraction workflow follows these steps:

Schema Definition: Define a JSON schema that describes the structure of the data you want to extract.
Extraction: Use Retab’s extract method to process the document and retrieve structured data.
Validation & Usage: Validate the extracted data (optionally using likelihood scores) and integrate it into your application.

For advanced validation or post-processing, we recommend combining this with schema validation libraries like Pydantic (Python) or Zod (JavaScript) to ensure data integrity. Unlike the parse method that focuses on raw text extraction, extract provides:

Structured Output: Data extracted directly into JSON format matching your schema.
AI-Powered Inference: Handles complex layouts, handwritten text, and contextual understanding.
Modality Support: Works with text, images, or native document formats.
Consensus Mode: Optional multi-run consensus for higher accuracy on ambiguous documents.
Likelihood Scores: Provides confidence scores for each extracted field.
Batch Processing Ready: Efficient for high-volume extraction tasks.

Extract API

ExtractRequest

Show properties

document

string | object

required

The document to extract from. Can be a file path (string), or an object with filename (string) and url (string, e.g., base64-encoded data).

model

string

required

The AI model to use for extraction. Examples: gpt-4.1-nano for balanced accuracy and speed.

json_schema

object

required

The JSON schema defining the structure of the extracted data. Includes properties, required fields, and optional X-SystemPrompt for custom instructions.

modality

string

default:"text"

The processing modality: "text" for text-based extraction, "native" for handling native document formats like PDF.

n_consensus

integer

default:"1"

Number of consensus runs. Set to >1 for multi-run averaging to improve accuracy on uncertain extractions (increases cost).

Returns

ParsedChatCompletion

A ParsedChatCompletion object with the extracted data, usage details, and confidence scores.

Show properties

content

object

The core response content, including the OpenAI-compatible chat completion structure.

Show properties

string

Unique identifier for the completion.

choices

array[object]

Array of completion choices, typically containing one object with the extracted message.

created

integer

Timestamp of creation (Unix epoch).

model

string

The model used for extraction.

object

string

Object type, e.g., “chat.completion”.

service_tier

string

Service tier used.

system_fingerprint

string

System fingerprint for tracking.

usage

object

Token usage details.

likelihoods

object

Confidence scores for each extracted field (0-1 scale).

error

object | null

Error details if the request failed, otherwise null.

from retab import Retab

client = Retab()

# doc_msg = reclient.documents.extractions.stream(...) for stream mode
doc_msg = reclient.documents.extract(
    document = "freight/booking_confirmation.jpg", 
    model="gpt-4.1-nano",
    json_schema = {
      'X-SystemPrompt': 'You are a useful assistant.',
      'properties': {
          'name': {
              'description': 'The name of the calendar event.',
              'title': 'Name',
              'type': 'string'
          },
          'date': {
              'description': 'The date of the calendar event in ISO 8601 format.',
              'title': 'Date',
              'type': 'string'
          }
      },
      'required': ['name', 'date'],
      'title': 'CalendarEvent',
      'type': 'object'
    },
    modality="text",
    n_consensus=1 # 1 means disabled (default), if greater than 1 it will run the extraction with n-consensus mode
)

Use Case: Extracting Event Information from Documents

Extract structured calendar event data from a booking confirmation image and validate confidence scores before saving to a database.

from retab import Retab
from pydantic import BaseModel, ValidationError

client = Retab()

# Define Pydantic model matching the schema for validation
class CalendarEvent(BaseModel):
    name: str
    date: str  # ISO 8601

# Extract data
result = client.documents.extract(
    document="freight/booking_confirmation.jpg",
    model="gpt-4.1-nano",
    json_schema={
        'X-SystemPrompt': 'You are a useful assistant.',
        'properties': {
            'name': {
                'description': 'The name of the calendar event.',
                'title': 'Name',
                'type': 'string'
            },
            'date': {
                'description': 'The date of the calendar event in ISO 8601 format.',
                'title': 'Date',
                'type': 'string'
            }
        },
        'required': ['name', 'date'],
        'title': 'CalendarEvent',
        'type': 'object'
    },
    modality="text",
    n_consensus=1
)

# Access extracted data
extracted_data = result.content.choices[0].message.parsed
likelihoods = result.content.likelihoods

# Validate with Pydantic
try:
    event = CalendarEvent(**extracted_data)
    print(f"Extracted Event: {event.name} on {event.date}")
    
    # Check confidence
    if all(score > 0.7 for score in likelihoods.values()):
        print("High confidence extraction - Saving to DB...")
        # db.save(event)  # Pseudo-code for DB integration
    else:
        print("Low confidence - Review manually")
except ValidationError as e:
    print(f"Validation failed: {e}")

print(f"Processed with {result.content.usage.total_tokens} tokens")

Best Practices

Model Selection

gpt-4.1-nano: Balanced for accuracy and cost, recommended for most extraction tasks.
gemini-2.5-pro: Use for complex documents requiring deep contextual understanding.
gemini-2.5-flash: Faster and cheaper for simple extractions or high-volume processing.

Schema Design

Keep schemas concise: Only include required fields to improve extraction accuracy.
Use descriptive description fields: Helps the AI model understand what to extract.
Add X-SystemPrompt for custom guidance: E.g., “Focus on freight details” for domain-specific extractions.

Confidence Handling

Set a threshold (e.g., 0.7) for automated processing.
For critical tasks, enable n_consensus > 1 to average results and boost reliability.

Modality Choice

Text: For clean, text-heavy documents.
Native: For PDFs or images with layouts preserved.

Overview

Core Concepts

Introduction

Extract API

Use Case: Extracting Event Information from Documents

Best Practices

Model Selection

Schema Design

Confidence Handling

Modality Choice

Overview

Core Concepts

​Introduction

​Extract API

​Use Case: Extracting Event Information from Documents

​Best Practices

​Model Selection

​Schema Design

​Confidence Handling

​Modality Choice

Introduction

Extract API

Use Case: Extracting Event Information from Documents

Best Practices

Model Selection

Schema Design

Confidence Handling

Modality Choice