Introduction
Theextractions.create method in Retab’s document processing pipeline uses AI models to extract structured data from any document based on a provided JSON schema. This endpoint is ideal for automating data extraction tasks, such as pulling key information from invoices, forms, receipts, images, or scanned documents, for use in workflows like data entry automation, analytics, or integration with databases and applications.
The typical extraction workflow follows these steps:
- Schema Definition: Define a JSON schema that describes the structure of the data you want to extract.
- Extraction: Call
client.extractions.create(...)to process the document and retrieve structured data. - Validation & Usage: Validate the extracted data (optionally using likelihood scores) and integrate it into your application.
- Python parses
message.parsedagainst the JSON schema you pass in - Node accepts a JSON schema object, a schema file path, or a
zodschema directly
parse method that focuses on raw text extraction, extract provides:
- Structured Output: Data extracted directly into JSON format matching your schema.
- AI-Powered Inference: Handles complex layouts, handwritten text, and contextual understanding.
- Modality Support: Works with text, images, or native document formats.
- Consensus Mode: Optional multi-run consensus for higher accuracy on ambiguous documents.
- Likelihood Scores: Provides likelihood scores for each extracted field.
- Batch Processing Ready: Efficient for high-volume extraction tasks.
Extract API
An
Extraction record with the extracted data, usage details, consensus likelihoods, and a persisted id.Build Your Schema
Retab extraction quality depends heavily on schema quality. The fastest way to improve a schema is to run extraction with consensus enabled, inspect unstable fields, and tighten the schema until the low-confidence areas disappear.Why consensus helps
Consensus is Retab’s practical schema debugging loop. Multiple extraction runs using the same schema reveal where the model disagrees. Those disagreements usually mean one of three things:- The field name is ambiguous
- The field description is too loose
- The field type or structure is asking the model to normalize too much at once
0.75 likelihood need attention before production use.
Schema improvement levers
| Lever | When to apply | Concrete fix |
|---|---|---|
| Change field names | Models mix up concepts | Rename name to event_name |
| Improve descriptions | Correct concept, inconsistent format | Add format instructions and examples |
| Tighten field types | Strings, numbers, and dates drift across runs | Use stronger types and explicit ISO formats |
| Restructure nested data | One field combines multiple concepts | Replace address: string with nested address fields |
| Add reasoning prompts | Calculations or conversions are unstable | Add X-ReasoningPrompt to the field |
| Remove weak fields | The field is noisy and non-critical | Drop it or defer it to a later version |
Example: iterating on a schema
name: 1.0means the field is stabledate: 0.5usually means the format is underspecifiedaddress: 0.25usually means the field should be decomposed into structured subfields
- change
dateto an ISO date field - split
addressintostreet,city,zip_code, andcountry
- Start schema iteration with
n_consensus=4or5 - Fix the lowest-likelihood fields first
- Test with diverse documents, not one golden sample
- Treat consensus as a schema debugging tool, not just a scoring feature
Reasoning
Use reasoning when a field requires calculation, conversion, or multi-step logic. Retab supports this through theX-ReasoningPrompt JSON Schema annotation.
The model then produces an auxiliary reasoning___<field_name> output during extraction, which gives it room to work through the logic before returning the final field value.
Example: Celsius to Fahrenheit
22.5 directly. With reasoning enabled, it is more likely to compute 72.5 and explain the conversion in reasoning___temperature.
Use reasoning for:
- unit conversions
- tax or total calculations
- date normalization
- derived fields that depend on multiple source values
Sources
Retab can also return provenance for extraction results. Given an extraction ID,sources returns the same output structure, but every scalar leaf becomes a { value, source } object pointing back to the original document location.
This is useful for:
- review UIs with citations
- provenance audits
- highlighting extracted fields in viewers
- debugging incorrect outputs
Response shape
Every sourced leaf looks like this:pdf_bbox: normalized bounding box on a PDF pageimage_bbox: normalized bounding box on an imagecsv_cell: row and column reference in CSVspreadsheet_cell: sheet and cell reference in XLSXdocx_text_spanordocx_table_cell: DOCX paragraph or table-cell locationtext_span: line and character offsets in plain text
source to null.
Managing Saved Extractions
Extractions are persisted resources. After callingclient.extractions.create(...), you can retrieve them later, list them with pagination, and filter them by metadata or date range.
Listing extractions
Useclient.extractions.list(...) when you need to browse recent extractions, paginate through large histories, or filter by metadata.
limit: page sizeorder:ascordescby creation datebeforeandafter: cursor paginationfrom_dateandto_date: date-range filtersmetadata: exact-match metadata filtering
Getting an extraction by ID
Filtering by metadata
Metadata is useful when you want to organize extractions by tenant, workflow, batch, or any other business key you attached at creation time.Use Case: Extracting Event Information from Documents
Extract structured calendar event data from a booking confirmation image and validate likelihood scores before saving to a database.Use Case: Passing Extra Instructions
Useinstructions to provide a freeform note that guides the extraction — useful for iteration context from a loop, tenant-specific conventions, or any hint that isn’t captured by the schema itself.
Metadata Filters
Usemetadata at extraction time to tag each document with stable identifiers, then filter those extractions later using the same keys.
This is especially important in multi-tenant systems: always include stable tenant-scoping metadata keys to avoid cross-tenant contamination.
1. Tag the extraction request with metadata
2. Filter extractions by metadata
Best Practices
Model Selection
retab-large: Use for complex documents requiring deep contextual understanding.retab-small: Balanced for accuracy and cost, recommended for most extraction tasks.retab-micro: Faster and cheaper for simple extractions or high-volume processing.
Schema Design
- Keep schemas concise: Only include required fields to improve extraction accuracy.
- Use descriptive
descriptionfields: Helps the AI model understand what to extract. - Add
X-SystemPromptfor custom guidance: E.g., “Focus on freight details” for domain-specific extractions.
Confidence Handling
- Set a threshold (e.g., 0.7) for automated processing.
- For critical tasks, enable
n_consensus > 1to average results and boost reliability.
Using Instructions
- Use
instructionsto pass domain-specific hints that aren’t part of the schema. - Ideal for: currency defaults, date format preferences, handling ambiguous abbreviations, or specifying regional conventions.
- Keep the string concise to avoid diluting the extraction focus.