Extract structured data from a document against a JSON schema and persist the result as an Extraction resource that can later be retrieved via GET /v1/extractions/{extraction_id} or listed via GET /v1/extractions.
from retab import Retab
client = Retab()
extraction = client.extractions.create(
document = "Invoice.pdf" ,
json_schema = "Invoice_schema.json" ,
model = "retab-small" ,
n_consensus = 1 ,
metadata = { "source" : "docs" },
)
print ( f "Extraction ID: { extraction.id } " )
print ( f "Filename: { extraction.file.filename } " )
print (extraction.output)
{
"id" : "extr_01G34H8J2K" ,
"organization_id" : "org_abc123" ,
"file" : {
"id" : "file_6dd6eb00688ad8d1" ,
"filename" : "Invoice.pdf" ,
"mime_type" : "application/pdf"
},
"model" : "retab-small" ,
"json_schema" : {
"type" : "object" ,
"properties" : {
"invoice_number" : { "type" : "string" },
"total" : { "type" : "number" }
}
},
"n_consensus" : 1 ,
"image_resolution_dpi" : 192 ,
"output" : {
"invoice_number" : "INV-2024-0042" ,
"total" : 1234.56
},
"consensus" : {
"choices" : [],
"likelihoods" : {
"invoice_number" : 0.98 ,
"total" : 0.97
}
},
"metadata" : {
"source" : "docs"
},
"usage" : {
"prompt_tokens" : 2760 ,
"completion_tokens" : 20 ,
"total_tokens" : 2780
},
"created_at" : "2024-03-15T10:30:00Z" ,
"updated_at" : "2024-03-15T10:30:00Z"
}
Request Body
The document to extract from. HTTP callers must pass a MIMEData object with filename and url (data URL or https URL). The Python and Node SDKs also accept file paths, file-like objects, buffers, and URLs and convert them for you.
JSON Schema describing the structured output.
model
string
default: "retab-small"
The model used for the extraction.
DPI used when rasterizing pages. Accepted values are 96 to 300.
Parallel OCR chunking keys for long list fields, e.g. {"line_items": "identity.id"}.
Additional context for the extraction (e.g. iteration context from a workflow loop).
Number of consensus extraction runs to perform. Uses a deterministic single-pass when set to 1. Max: 16.
User-defined metadata to associate with this extraction.
When true, bypass the cache and re-run the extraction.
Response Fields
Unique extraction identifier.
File metadata: id, filename, mime_type.
Model used for the extraction.
JSON Schema used for the extraction.
The extracted structured data matching json_schema.
Consensus metadata. Alternative extraction vote outputs used to build the consolidated result.
Consensus likelihood tree mirroring the extraction output.
Number of consensus votes used.
DPI used when rendering images.
User-defined metadata echoed from the request.
Token and credit usage information.
ISO 8601 creation timestamp.
ISO 8601 last update timestamp.
document
MIMEData · object
required
The document to extract from
json_schema
Json Schema · object
required
JSON schema describing the structured output
model
string
default: retab-small
The model to use for the extraction
Resolution of the image sent to the LLM
Required range: 96 <= x <= 300
Parallel OCR chunking keys for long list fields
Example: { "products" : "identity.id" , "properties" : "ID" } Number of consensus extraction runs to perform. Uses deterministic single-pass when set to 1.
Required range: 1 <= x <= 16
User-defined metadata to associate with this extraction
If true, skip the LLM cache and force a fresh completion
additional_messages
Additional Messages · object[] | null
Backend-internal extraction record with organization scoping.
file
ExtractionFile · object
required
Backend-internal file reference (adds gcs_path)
Model used for the extraction
json_schema
Json Schema · object
required
JSON schema used for the extraction
The extracted structured data
Organization ID of the user or application
Unique identifier of the extraction
Number of consensus votes used
DPI used to render document images
Parallel OCR chunking keys for long list fields
Additional context supplied with the extraction request
consensus
ExtractionConsensus · object
Consensus metadata for multi-vote extraction runs
origin
ProcessingRequestOrigin · object
Origin of the extraction request
Usage information for the extraction