Skip to main content

Introduction

The extractions.create method in Retab’s document processing pipeline uses AI models to extract structured data from any document based on a provided JSON schema. This endpoint is ideal for automating data extraction tasks, such as pulling key information from invoices, forms, receipts, images, or scanned documents, for use in workflows like data entry automation, analytics, or integration with databases and applications. The typical extraction workflow follows these steps:
  1. Schema Definition: Define a JSON schema that describes the structure of the data you want to extract.
  2. Extraction: Call client.extractions.create(...) to process the document and retrieve structured data.
  3. Validation & Usage: Validate the extracted data (optionally using likelihood scores) and integrate it into your application.
The SDKs already help here:
  • Python parses message.parsed against the JSON schema you pass in
  • Node accepts a JSON schema object, a schema file path, or a zod schema directly
You can still add your own application-level validation if you want stricter business rules. Unlike the parse method that focuses on raw text extraction, extract provides:
  • Structured Output: Data extracted directly into JSON format matching your schema.
  • AI-Powered Inference: Handles complex layouts, handwritten text, and contextual understanding.
  • Modality Support: Works with text, images, or native document formats.
  • Consensus Mode: Optional multi-run consensus for higher accuracy on ambiguous documents.
  • Likelihood Scores: Provides likelihood scores for each extracted field.
  • Batch Processing Ready: Efficient for high-volume extraction tasks.

Extract API

ExtractionRequest
ExtractionRequest
Returns
Extraction
An Extraction record with the extracted data, usage details, consensus likelihoods, and a persisted id.
from retab import Retab
from retab.types.extractions import ExtractionRequest

client = Retab()

extraction = client.extractions.create(
    ExtractionRequest(
        document="freight/booking_confirmation.jpg",
        model="retab-micro",
        json_schema={
            'X-SystemPrompt': 'You are a useful assistant.',
            'properties': {
                'name': {
                    'description': 'The name of the calendar event.',
                    'title': 'Name',
                    'type': 'string'
                },
                'date': {
                    'description': 'The date of the calendar event in ISO 8601 format.',
                    'title': 'Date',
                    'type': 'string'
                }
            },
            'required': ['name', 'date'],
            'title': 'CalendarEvent',
            'type': 'object'
        },
        n_consensus=1,  # 1 means disabled (default); if > 1 the extraction runs in consensus mode
    )
)

print(extraction.output)
print(extraction.consensus.likelihoods)
print(extraction.id)

Build Your Schema

Retab extraction quality depends heavily on schema quality. The fastest way to improve a schema is to run extraction with consensus enabled, inspect unstable fields, and tighten the schema until the low-confidence areas disappear.

Why consensus helps

Consensus is Retab’s practical schema debugging loop. Multiple extraction runs using the same schema reveal where the model disagrees. Those disagreements usually mean one of three things:
  • The field name is ambiguous
  • The field description is too loose
  • The field type or structure is asking the model to normalize too much at once
As a rule of thumb, fields below 0.75 likelihood need attention before production use.

Schema improvement levers

LeverWhen to applyConcrete fix
Change field namesModels mix up conceptsRename name to event_name
Improve descriptionsCorrect concept, inconsistent formatAdd format instructions and examples
Tighten field typesStrings, numbers, and dates drift across runsUse stronger types and explicit ISO formats
Restructure nested dataOne field combines multiple conceptsReplace address: string with nested address fields
Add reasoning promptsCalculations or conversions are unstableAdd X-ReasoningPrompt to the field
Remove weak fieldsThe field is noisy and non-criticalDrop it or defer it to a later version

Example: iterating on a schema

from retab import Retab
from retab.types.extractions import ExtractionRequest

client = Retab()

initial_schema = {
    "title": "CalendarEvent",
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "date": {"type": "string"},
        "address": {"type": "string"},
    },
    "required": ["name", "date", "address"],
}

result = client.extractions.create(
    ExtractionRequest(
        document="event_flyer.pdf",
        model="retab-small",
        json_schema=initial_schema,
        n_consensus=4,
    )
)

print(result.output)
print(result.consensus.likelihoods if result.consensus else None)
Example interpretation:
  • name: 1.0 means the field is stable
  • date: 0.5 usually means the format is underspecified
  • address: 0.25 usually means the field should be decomposed into structured subfields
Common next step:
  • change date to an ISO date field
  • split address into street, city, zip_code, and country
Best practices:
  • Start schema iteration with n_consensus=4 or 5
  • Fix the lowest-likelihood fields first
  • Test with diverse documents, not one golden sample
  • Treat consensus as a schema debugging tool, not just a scoring feature

Reasoning

Use reasoning when a field requires calculation, conversion, or multi-step logic. Retab supports this through the X-ReasoningPrompt JSON Schema annotation. The model then produces an auxiliary reasoning___<field_name> output during extraction, which gives it room to work through the logic before returning the final field value.

Example: Celsius to Fahrenheit

from pydantic import BaseModel, Field

class TemperatureReport(BaseModel):
    temperature: float = Field(
        ...,
        description="temperature in Fahrenheit",
        json_schema_extra={
            "X-ReasoningPrompt": (
                "If the temperature is given in Celsius, explicitly convert it "
                "to Fahrenheit. If it is already in Fahrenheit, leave it unchanged."
            )
        },
    )

print(TemperatureReport.model_json_schema())
Without reasoning, a model may copy 22.5 directly. With reasoning enabled, it is more likely to compute 72.5 and explain the conversion in reasoning___temperature. Use reasoning for:
  • unit conversions
  • tax or total calculations
  • date normalization
  • derived fields that depend on multiple source values

Sources

Retab can also return provenance for extraction results. Given an extraction ID, sources returns the same output structure, but every scalar leaf becomes a { value, source } object pointing back to the original document location. This is useful for:
  • review UIs with citations
  • provenance audits
  • highlighting extracted fields in viewers
  • debugging incorrect outputs
from retab import Retab

client = Retab()

sources = client.extractions.sources("extr_01G34H8J2K")

print(sources.extraction)
print(sources.sources)

Response shape

Every sourced leaf looks like this:
{
  "value": "INV-1032",
  "source": {
    "content": "INV-1032",
    "anchor": {
      "kind": "pdf_bbox",
      "page": 1,
      "left": 0.60,
      "top": 0.12,
      "width": 0.25,
      "height": 0.03
    }
  }
}
Common anchor types:
  • pdf_bbox: normalized bounding box on a PDF page
  • image_bbox: normalized bounding box on an image
  • csv_cell: row and column reference in CSV
  • spreadsheet_cell: sheet and cell reference in XLSX
  • docx_text_span or docx_table_cell: DOCX paragraph or table-cell location
  • text_span: line and character offsets in plain text
If a field cannot be sourced, Retab still returns the value and sets source to null.

Managing Saved Extractions

Extractions are persisted resources. After calling client.extractions.create(...), you can retrieve them later, list them with pagination, and filter them by metadata or date range.

Listing extractions

Use client.extractions.list(...) when you need to browse recent extractions, paginate through large histories, or filter by metadata.
from datetime import datetime
from retab import Retab

client = Retab()

recent = client.extractions.list(
    limit=10,
    order="desc",
)

org_specific = client.extractions.list(
    metadata={"organization_id": "org_acme_corp"},
    limit=50,
)

date_filtered = client.extractions.list(
    from_date=datetime(2024, 1, 1),
    to_date=datetime(2024, 12, 31),
)
Useful list filters:
  • limit: page size
  • order: asc or desc by creation date
  • before and after: cursor pagination
  • from_date and to_date: date-range filters
  • metadata: exact-match metadata filtering

Getting an extraction by ID

from retab import Retab

client = Retab()

extraction = client.extractions.get("extr_01G34H8J2K")
print(extraction)

Filtering by metadata

Metadata is useful when you want to organize extractions by tenant, workflow, batch, or any other business key you attached at creation time.
from retab import Retab

client = Retab()

org_extractions = client.extractions.list(
    metadata={"organization_id": "org_acme_corp"},
    limit=100,
)
For full request and response details, see the extraction endpoints in the API Reference.

Use Case: Extracting Event Information from Documents

Extract structured calendar event data from a booking confirmation image and validate likelihood scores before saving to a database.
from retab import Retab
from retab.types.extractions import ExtractionRequest

client = Retab()

# Extract data
result = client.extractions.create(
    ExtractionRequest(
        document="freight/booking_confirmation.jpg",
        model="retab-micro",
        json_schema={
            'X-SystemPrompt': 'You are a useful assistant.',
            'properties': {
                'name': {
                    'description': 'The name of the calendar event.',
                    'title': 'Name',
                    'type': 'string'
                },
                'date': {
                    'description': 'The date of the calendar event in ISO 8601 format.',
                    'title': 'Date',
                    'type': 'string'
                }
            },
            'required': ['name', 'date'],
            'title': 'CalendarEvent',
            'type': 'object'
        },
        n_consensus=1,
    )
)

extracted_data = result.output
likelihoods = (result.consensus.likelihoods or {}) if result.consensus else {}

print(f"Extracted Event: {extracted_data.get('name')} on {extracted_data.get('date')}")

if likelihoods and all(score > 0.7 for score in likelihoods.values()):
    print("High likelihood extraction - Saving to DB...")
else:
    print("Low likelihood - Review manually")

if result.usage:
    print(f"Processed with {result.usage.total_tokens} tokens")

Use Case: Passing Extra Instructions

Use instructions to provide a freeform note that guides the extraction — useful for iteration context from a loop, tenant-specific conventions, or any hint that isn’t captured by the schema itself.
from retab import Retab
from retab.types.extractions import ExtractionRequest

client = Retab()

result = client.extractions.create(
    ExtractionRequest(
        document="invoices/invoice_001.pdf",
        model="retab-micro",
        json_schema={
            'properties': {
                'vendor_name': {'type': 'string', 'description': 'Name of the vendor'},
                'invoice_number': {'type': 'string', 'description': 'Invoice number'},
                'total_amount': {'type': 'number', 'description': 'Total amount due'},
                'currency': {'type': 'string', 'description': 'Currency code (e.g., USD, EUR)'}
            },
            'required': ['vendor_name', 'invoice_number', 'total_amount'],
            'type': 'object'
        },
        instructions=(
            "This invoice is from our European supplier. "
            "Amounts should be in EUR unless explicitly stated otherwise. "
            "Extract values exactly as written; do not infer a missing currency."
        ),
    )
)

print(result.output)

Metadata Filters

Use metadata at extraction time to tag each document with stable identifiers, then filter those extractions later using the same keys. This is especially important in multi-tenant systems: always include stable tenant-scoping metadata keys to avoid cross-tenant contamination.

1. Tag the extraction request with metadata

from retab import Retab
from retab.types.extractions import ExtractionRequest

client = Retab()

result = client.extractions.create(
    ExtractionRequest(
        document="freight/booking_confirmation.jpg",
        model="retab-small",
        json_schema=my_schema,
        metadata={
            "organization_id": "org_123",
            "source": "sinari_jobs",
            "project_id": "project_abc",
        },
    )
)

2. Filter extractions by metadata

from retab import Retab

client = Retab()

tenant_extractions = client.extractions.list(
    limit=50,
    metadata={
        "organization_id": "org_123",
        "source": "sinari_jobs",
    },
)
When multiple metadata keys are provided, all keys are applied together as exact-match filters.

Best Practices

Model Selection

  • retab-large: Use for complex documents requiring deep contextual understanding.
  • retab-small: Balanced for accuracy and cost, recommended for most extraction tasks.
  • retab-micro: Faster and cheaper for simple extractions or high-volume processing.

Schema Design

  • Keep schemas concise: Only include required fields to improve extraction accuracy.
  • Use descriptive description fields: Helps the AI model understand what to extract.
  • Add X-SystemPrompt for custom guidance: E.g., “Focus on freight details” for domain-specific extractions.

Confidence Handling

  • Set a threshold (e.g., 0.7) for automated processing.
  • For critical tasks, enable n_consensus > 1 to average results and boost reliability.

Using Instructions

  • Use instructions to pass domain-specific hints that aren’t part of the schema.
  • Ideal for: currency defaults, date format preferences, handling ambiguous abbreviations, or specifying regional conventions.
  • Keep the string concise to avoid diluting the extraction focus.
from retab.types.extractions import ExtractionRequest

result = client.extractions.create(
    ExtractionRequest(
        document="invoices/invoice_001.pdf",
        json_schema=my_schema,
        model="retab-small",
        instructions="This invoice is from our European supplier. Amounts should be in EUR unless explicitly stated otherwise.",
    )
)