Extract - Retab Docs

Introduction

The extractions.create method in Retab’s document processing pipeline uses AI models to extract structured data from any document based on a provided JSON schema. This endpoint is ideal for automating data extraction tasks, such as pulling key information from invoices, forms, receipts, images, or scanned documents, for use in workflows like data entry automation, analytics, or integration with databases and applications. The typical extraction workflow follows these steps:

Schema Definition: Define a JSON schema that describes the structure of the data you want to extract.
Extraction: Call client.extractions.create(...) to process the document and retrieve structured data.
Validation & Usage: Validate the extracted data (optionally using likelihood scores) and integrate it into your application.

The SDKs already help here:

Python parses message.parsed against the JSON schema you pass in
Node accepts a JSON schema object, a schema file path, or a zod schema directly

You can still add your own application-level validation if you want stricter business rules. Unlike the parse method that focuses on raw text extraction, extract provides:

Structured Output: Data extracted directly into JSON format matching your schema.
AI-Powered Inference: Handles complex layouts, handwritten text, and contextual understanding.
Modality Support: Works with text, images, or native document formats.
Consensus Mode: Optional multi-run consensus for higher accuracy on ambiguous documents.
Likelihood Scores: Provides likelihood scores for each extracted field.
Batch Processing Ready: Efficient for high-volume extraction tasks.

Extract API

ExtractionRequest

Show properties

document

string | object

required

The document to extract from. Can be a file path (string), or an object with filename (string) and url (string, e.g., base64-encoded data).

model

string

required

The AI model to use for extraction. Examples: retab-small for balanced accuracy and speed.

json_schema

object

required

The JSON schema defining the structure of the extracted data. Includes properties, required fields, and optional X-SystemPrompt for custom instructions.

image_resolution_dpi

integer

default:"192"

The DPI of the image sent to the LLM. Defaults to 192.

metadata

dict[str, str]

default:"{}"

User-defined metadata to associate with this extraction. Defaults to .

n_consensus

integer

default:"1"

Number of consensus runs. Set to >1 for multi-run averaging to improve accuracy on uncertain extractions (increases cost).

instructions

string | null

default:"null"

Free-form instructions appended to the system prompt to steer the extraction.

Returns

Extraction

An Extraction record with the extracted data, usage details, consensus likelihoods, and a persisted id.

Show properties

string

Unique identifier for the stored extraction record.

output

object

The extracted structured data (parsed against your json_schema).

consensus

object

Consensus metadata. consensus.likelihoods holds per-field confidence scores (populated when n_consensus > 1).

usage

object

Token usage details.

file

object

Information about the extracted file.

from retab import Retab
from retab.types.extractions import ExtractionRequest

client = Retab()

extraction = client.extractions.create(
document="freight/booking_confirmation.jpg",
model="retab-micro",
json_schema={
'X-SystemPrompt': 'You are a useful assistant.',
'properties': {
'name': {
'description': 'The name of the calendar event.',
'title': 'Name',
'type': 'string'
},
'date': {
'description': 'The date of the calendar event in ISO 8601 format.',
'title': 'Date',
'type': 'string'
}
},
'required': ['name', 'date'],
'title': 'CalendarEvent',
'type': 'object'
},
n_consensus=1, # 1 means disabled (default); if > 1 the extraction runs in consensus mode
)

print(extraction.output)
print(extraction.consensus.likelihoods)
print(extraction.id)

Build Your Schema

Retab extraction quality depends heavily on schema quality. The fastest way to improve a schema is to run extraction with consensus enabled, inspect unstable fields, and tighten the schema until the low-confidence areas disappear.

Why consensus helps

Consensus is Retab’s practical schema debugging loop. Multiple extraction runs using the same schema reveal where the model disagrees. Those disagreements usually mean one of three things:

The field name is ambiguous
The field description is too loose
The field type or structure is asking the model to normalize too much at once

As a rule of thumb, fields below 0.75 likelihood need attention before production use.

Schema improvement levers

Lever	When to apply	Concrete fix
Change field names	Models mix up concepts	Rename `name` to `event_name`
Improve descriptions	Correct concept, inconsistent format	Add format instructions and examples
Tighten field types	Strings, numbers, and dates drift across runs	Use stronger types and explicit ISO formats
Restructure nested data	One field combines multiple concepts	Replace `address: string` with nested address fields
Add reasoning prompts	Calculations or conversions are unstable	Add `X-ReasoningPrompt` to the field
Remove weak fields	The field is noisy and non-critical	Drop it or defer it to a later version

Example: iterating on a schema

from retab import Retab
from retab.types.extractions import ExtractionRequest

client = Retab()

initial_schema = {
"title": "CalendarEvent",
"type": "object",
"properties": {
"name": {"type": "string"},
"date": {"type": "string"},
"address": {"type": "string"},
},
"required": ["name", "date", "address"],
}

result = client.extractions.create(
document="event_flyer.pdf",
model="retab-small",
json_schema=initial_schema,
n_consensus=4,
)

print(result.output)
print(result.consensus.likelihoods if result.consensus else None)

Example interpretation:

name: 1.0 means the field is stable
date: 0.5 usually means the format is underspecified
address: 0.25 usually means the field should be decomposed into structured subfields

Common next step:

change date to an ISO date field
split address into street, city, zip_code, and country

Best practices:

Start schema iteration with n_consensus=4 or 5
Fix the lowest-likelihood fields first
Test with diverse documents, not one golden sample
Treat consensus as a schema debugging tool, not just a scoring feature

Reasoning

Use reasoning when a field requires calculation, conversion, or multi-step logic. Retab supports this through the X-ReasoningPrompt JSON Schema annotation. The model then produces an auxiliary reasoning___<field_name> output during extraction, which gives it room to work through the logic before returning the final field value.

Example: Celsius to Fahrenheit

from pydantic import BaseModel, Field

class TemperatureReport(BaseModel):
temperature: float = Field(
...,
description="temperature in Fahrenheit",
json_schema_extra={
"X-ReasoningPrompt": (
"If the temperature is given in Celsius, explicitly convert it "
"to Fahrenheit. If it is already in Fahrenheit, leave it unchanged."
)
},
)

print(TemperatureReport.model_json_schema())

Without reasoning, a model may copy 22.5 directly. With reasoning enabled, it is more likely to compute 72.5 and explain the conversion in reasoning___temperature. Use reasoning for:

unit conversions
tax or total calculations
date normalization
derived fields that depend on multiple source values

Sources

Retab can also return provenance for extraction results. Given an extraction ID, sources returns the same output structure, but every scalar leaf becomes a { value, source } object pointing back to the original document location. This is useful for:

review UIs with citations
provenance audits
highlighting extracted fields in viewers
debugging incorrect outputs

from retab import Retab

client = Retab()

sources = client.extractions.sources("extr_01G34H8J2K")

print(sources.extraction)
print(sources.sources)

Response shape

Every sourced leaf looks like this:

{
  "value": "INV-1032",
  "source": {
    "content": "INV-1032",
    "anchor": {
      "kind": "pdf_bbox",
      "page": 1,
      "left": 0.6,
      "top": 0.12,
      "width": 0.25,
      "height": 0.03
    }
  }
}

Common anchor types:

pdf_bbox: normalized bounding box on a PDF page
image_bbox: normalized bounding box on an image
csv_cell: row and column reference in CSV
spreadsheet_cell: sheet and cell reference in XLSX
docx_text_span or docx_table_cell: DOCX paragraph or table-cell location
text_span: line and character offsets in plain text

If a field cannot be sourced, Retab still returns the value and sets source to null.

Managing Saved Extractions

Extractions are persisted resources. After calling client.extractions.create(...), you can retrieve them later, list them with pagination, and filter them by metadata or date range.

Listing extractions

Use client.extractions.list(...) when you need to browse recent extractions, paginate through large histories, or filter by metadata.

from datetime import datetime
from retab import Retab

client = Retab()

recent = client.extractions.list(
limit=10,
order="desc",
)

customer_specific = client.extractions.list(
metadata={"customer_id": "cust_acme_corp"},
limit=50,
)

date_filtered = client.extractions.list(
from_date=datetime(2024, 1, 1),
to_date=datetime(2024, 12, 31),
)

Useful list filters:

limit: page size
order: asc or desc by creation date
before and after: id pagination
from_date and to_date: date-range filters
metadata: exact-match metadata filtering

Getting an extraction by ID

from retab import Retab

client = Retab()

extraction = client.extractions.get("extr_01G34H8J2K")
print(extraction)

Filtering by metadata

Metadata is useful when you want to organize extractions by tenant, workflow, batch, or any other business key you attached at creation time.

from retab import Retab

client = Retab()

customer_extractions = client.extractions.list(
metadata={"customer_id": "cust_acme_corp"},
limit=100,
)

For full request and response details, see the extraction endpoints in the API Reference.

Use Case: Extracting Event Information from Documents

Extract structured calendar event data from a booking confirmation image and validate likelihood scores before saving to a database.

from retab import Retab
from retab.types.extractions import ExtractionRequest

client = Retab()

# Extract data.
result = client.extractions.create(
    document="freight/booking_confirmation.jpg",
    model="retab-micro",
    json_schema={
        'X-SystemPrompt': 'You are a useful assistant.',
        'properties': {
            'name': {
                'description': 'The name of the calendar event.',
                'title': 'Name',
                'type': 'string',
            },
            'date': {
                'description': 'The date of the calendar event in ISO 8601 format.',
                'title': 'Date',
                'type': 'string',
            },
        },
        'required': ['name', 'date'],
        'title': 'CalendarEvent',
        'type': 'object',
    },
    n_consensus=1,
)

extracted_data = result.output
likelihoods = (result.consensus.likelihoods or {}) if result.consensus else {}

print(f"Extracted Event: {extracted_data.get('name')} on {extracted_data.get('date')}")

if likelihoods and all(score > 0.7 for score in likelihoods.values()):
    print("High likelihood extraction - Saving to DB...")
else:
    print("Low likelihood - Review manually")

if result.usage:
    print(f"Processed using {result.usage.credits} credits")

Use Case: Passing Extra Instructions

Use instructions to provide a freeform note that guides the extraction — useful for iteration context from a loop, tenant-specific conventions, or any hint that isn’t captured by the schema itself.

from retab import Retab
from retab.types.extractions import ExtractionRequest

client = Retab()

result = client.extractions.create(
document="invoices/invoice_001.pdf",
model="retab-micro",
json_schema={
'properties': {
'vendor_name': {'type': 'string', 'description': 'Name of the vendor'},
'invoice_number': {'type': 'string', 'description': 'Invoice number'},
'total_amount': {'type': 'number', 'description': 'Total amount due'},
'currency': {'type': 'string', 'description': 'Currency code (e.g., USD, EUR)'}
},
'required': ['vendor_name', 'invoice_number', 'total_amount'],
'type': 'object'
},
instructions=(
"This invoice is from our European supplier. "
"Amounts should be in EUR unless explicitly stated otherwise. "
"Extract values exactly as written; do not infer a missing currency."
),
)

print(result.output)

Metadata Filters

Use metadata at extraction time to tag each document with stable identifiers, then filter those extractions later using the same keys. This is especially important in multi-tenant systems: always include stable tenant-scoping metadata keys to avoid cross-tenant contamination.

1. Tag the extraction request with metadata

from retab import Retab
from retab.types.extractions import ExtractionRequest

client = Retab()

result = client.extractions.create(
document="freight/booking_confirmation.jpg",
model="retab-small",
json_schema=my_schema,
metadata={
"customer_id": "cust_123",
"source": "sinari_jobs",
"project_id": "project_abc",
},
)

2. Filter extractions by metadata

from retab import Retab

client = Retab()

customer_extractions = client.extractions.list(
limit=50,
metadata={
"customer_id": "cust_123",
"source": "sinari_jobs",
},
)

When multiple metadata keys are provided, all keys are applied together as exact-match filters.

Best Practices

Model Selection

retab-large: Use for complex documents requiring deep contextual understanding.
retab-small: Balanced for accuracy and cost, recommended for most extraction tasks.
retab-micro: Faster and cheaper for simple extractions or high-volume processing.

Schema Design

Keep schemas concise: Only include required fields to improve extraction accuracy.
Use descriptive description fields: Helps the AI model understand what to extract.
Add X-SystemPrompt for custom guidance: E.g., “Focus on freight details” for domain-specific extractions.

Confidence Handling

Set a threshold (e.g., 0.7) for automated processing.
For critical tasks, enable n_consensus > 1 to average results and boost reliability.

Using Instructions

Use instructions to pass domain-specific hints that aren’t part of the schema.
Ideal for: currency defaults, date format preferences, handling ambiguous abbreviations, or specifying regional conventions.
Keep the string concise to avoid diluting the extraction focus.

from retab.types.extractions import ExtractionRequest

result = client.extractions.create(
    document="invoices/invoice_001.pdf",
    json_schema=my_schema,
    model="retab-small",
    instructions="This invoice is from our European supplier. Amounts should be in EUR unless explicitly stated otherwise.",
)

​Introduction

​Extract API

​Build Your Schema

​Why consensus helps

​Schema improvement levers

​Example: iterating on a schema

​Reasoning

​Example: Celsius to Fahrenheit

​Sources

​Response shape

​Managing Saved Extractions

​Listing extractions

​Getting an extraction by ID

​Filtering by metadata

​Use Case: Extracting Event Information from Documents

​Use Case: Passing Extra Instructions

​Metadata Filters

​1. Tag the extraction request with metadata

​2. Filter extractions by metadata

​Best Practices

​Model Selection

​Schema Design

​Confidence Handling

​Using Instructions

Introduction

Extract API

Build Your Schema

Why consensus helps

Schema improvement levers

Example: iterating on a schema

Reasoning

Example: Celsius to Fahrenheit

Sources

Response shape

Managing Saved Extractions

Listing extractions

Getting an extraction by ID

Filtering by metadata

Use Case: Extracting Event Information from Documents

Use Case: Passing Extra Instructions

Metadata Filters

1. Tag the extraction request with metadata

2. Filter extractions by metadata

Best Practices

Model Selection

Schema Design

Confidence Handling

Using Instructions