Documentation Index Fetch the complete documentation index at: https://docs.retab.com/llms.txt
Use this file to discover all available pages before exploring further.
Extract structured data from a document against a JSON schema and persist the result as an Extraction resource that can later be retrieved via GET /v1/extractions/{extraction_id} or listed via GET /v1/extractions.
from retab import Retab
client = Retab()
extraction = client.extractions.create(
document = "Invoice.pdf" ,
json_schema = "Invoice_schema.json" ,
model = "retab-small" ,
n_consensus = 1 ,
metadata = { "source" : "docs" },
)
print ( f "Extraction ID: { extraction.id } " )
print ( f "Filename: { extraction.file.filename } " )
print (extraction.output)
{
"id" : "extr_01G34H8J2K" ,
"file" : {
"id" : "file_6dd6eb00688ad8d1" ,
"filename" : "Invoice.pdf" ,
"mime_type" : "application/pdf"
},
"model" : "retab-small" ,
"json_schema" : {
"type" : "object" ,
"properties" : {
"invoice_number" : { "type" : "string" },
"total" : { "type" : "number" }
}
},
"n_consensus" : 1 ,
"image_resolution_dpi" : 192 ,
"output" : {
"invoice_number" : "INV-2024-0042" ,
"total" : 1234.56
},
"consensus" : {
"choices" : [],
"likelihoods" : {
"invoice_number" : 0.98 ,
"total" : 0.97
}
},
"metadata" : {
"source" : "docs"
},
"usage" : {
"prompt_tokens" : 2760 ,
"completion_tokens" : 20 ,
"total_tokens" : 2780
},
"created_at" : "2024-03-15T10:30:00Z" ,
"updated_at" : "2024-03-15T10:30:00Z"
}
Request Body
The document to extract from. HTTP callers must pass a MIMEData object with filename and url (data URL or https URL). The Python and Node SDKs also accept file paths, file-like objects, buffers, and URLs and convert them for you.
JSON Schema describing the structured output.
model
string
default: "retab-small"
The model used for the extraction.
DPI used when rasterizing pages. Accepted values are 96 to 300.
Free-form instructions appended to the system prompt to steer the extraction.
Number of consensus extraction runs to perform. Uses a deterministic single-pass when set to 1. Max: 16.
User-defined metadata to associate with this extraction.
When true, bypass the cache and re-run the extraction.
Response Fields
Unique extraction identifier.
File metadata: id, filename, mime_type.
Model used for the extraction.
JSON Schema used for the extraction.
The extracted structured data matching json_schema.
Consensus metadata. Alternative extraction vote outputs used to build the consolidated result.
Consensus likelihood tree mirroring the extraction output.
Number of consensus votes used.
DPI used when rendering images.
User-defined metadata echoed from the request.
Token and credit usage information.
ISO 8601 creation timestamp.
ISO 8601 last update timestamp.
document
MIMEData · object
required
The document to extract from
json_schema
Json Schema · object
required
JSON schema describing the structured output
model
string
default: retab-small
The model to use for the extraction
Resolution of the image sent to the LLM
Required range: 96 <= x <= 300
Free-form instructions appended to the system prompt to steer the extraction.
Number of consensus extraction runs to perform. Uses deterministic single-pass when set to 1.
Required range: 1 <= x <= 16
User-defined metadata to associate with this extraction
additional_messages
Additional Messages · object[] | null
If true, skip the LLM cache and force a fresh completion
Backend-internal extraction record with organization scoping.
file
ExtractionFile · object
required
Backend-internal file reference (adds gcs_path)
Model used for the extraction
json_schema
Json Schema · object
required
JSON schema used for the extraction
The extracted structured data
Unique identifier of the extraction
Number of consensus votes used
DPI used to render document images
Free-form instructions supplied with the extraction request.
consensus
ExtractionConsensus · object
Consensus metadata for multi-vote extraction runs
origin
ProcessingRequestOrigin · object
Origin of the extraction request
Usage information for the extraction