Skip to main content
POST
/
v1
/
extractions
from retab import Retab

client = Retab()

extraction = client.extractions.create(
    document="Invoice.pdf",
    json_schema="Invoice_schema.json",
    model="retab-small",
    n_consensus=1,
    metadata={"source": "docs"},
)

print(f"Extraction ID: {extraction.id}")
print(f"Filename: {extraction.file.filename}")
print(extraction.output)
{
  "id": "extr_01G34H8J2K",
  "organization_id": "org_abc123",
  "file": {
    "id": "file_6dd6eb00688ad8d1",
    "filename": "Invoice.pdf",
    "mime_type": "application/pdf"
  },
  "model": "retab-small",
  "json_schema": {
    "type": "object",
    "properties": {
      "invoice_number": {"type": "string"},
      "total": {"type": "number"}
    }
  },
  "n_consensus": 1,
  "image_resolution_dpi": 192,
  "output": {
    "invoice_number": "INV-2024-0042",
    "total": 1234.56
  },
  "consensus": {
    "choices": [],
    "likelihoods": {
      "invoice_number": 0.98,
      "total": 0.97
    }
  },
  "metadata": {
    "source": "docs"
  },
  "usage": {
    "prompt_tokens": 2760,
    "completion_tokens": 20,
    "total_tokens": 2780
  },
  "created_at": "2024-03-15T10:30:00Z",
  "updated_at": "2024-03-15T10:30:00Z"
}
Extract structured data from a document against a JSON schema and persist the result as an Extraction resource that can later be retrieved via GET /v1/extractions/{extraction_id} or listed via GET /v1/extractions.
from retab import Retab

client = Retab()

extraction = client.extractions.create(
    document="Invoice.pdf",
    json_schema="Invoice_schema.json",
    model="retab-small",
    n_consensus=1,
    metadata={"source": "docs"},
)

print(f"Extraction ID: {extraction.id}")
print(f"Filename: {extraction.file.filename}")
print(extraction.output)
{
  "id": "extr_01G34H8J2K",
  "organization_id": "org_abc123",
  "file": {
    "id": "file_6dd6eb00688ad8d1",
    "filename": "Invoice.pdf",
    "mime_type": "application/pdf"
  },
  "model": "retab-small",
  "json_schema": {
    "type": "object",
    "properties": {
      "invoice_number": {"type": "string"},
      "total": {"type": "number"}
    }
  },
  "n_consensus": 1,
  "image_resolution_dpi": 192,
  "output": {
    "invoice_number": "INV-2024-0042",
    "total": 1234.56
  },
  "consensus": {
    "choices": [],
    "likelihoods": {
      "invoice_number": 0.98,
      "total": 0.97
    }
  },
  "metadata": {
    "source": "docs"
  },
  "usage": {
    "prompt_tokens": 2760,
    "completion_tokens": 20,
    "total_tokens": 2780
  },
  "created_at": "2024-03-15T10:30:00Z",
  "updated_at": "2024-03-15T10:30:00Z"
}

Request Body

document
MIMEData
required
The document to extract from. HTTP callers must pass a MIMEData object with filename and url (data URL or https URL). The Python and Node SDKs also accept file paths, file-like objects, buffers, and URLs and convert them for you.
json_schema
object
required
JSON Schema describing the structured output.
model
string
default:"retab-small"
The model used for the extraction.
image_resolution_dpi
integer
default:"192"
DPI used when rasterizing pages. Accepted values are 96 to 300.
chunking_keys
object
Parallel OCR chunking keys for long list fields, e.g. {"line_items": "identity.id"}.
context
string
Additional context for the extraction (e.g. iteration context from a workflow loop).
n_consensus
integer
default:"1"
Number of consensus extraction runs to perform. Uses a deterministic single-pass when set to 1. Max: 16.
metadata
object
User-defined metadata to associate with this extraction.
bust_cache
boolean
default:"false"
When true, bypass the cache and re-run the extraction.

Response Fields

id
string
Unique extraction identifier.
file
object
File metadata: id, filename, mime_type.
model
string
Model used for the extraction.
json_schema
object
JSON Schema used for the extraction.
output
object
The extracted structured data matching json_schema.
consensus
object
Consensus metadata.
n_consensus
integer
Number of consensus votes used.
image_resolution_dpi
integer
DPI used when rendering images.
metadata
object
User-defined metadata echoed from the request.
usage
RetabUsage | null
Token and credit usage information.
created_at
string
ISO 8601 creation timestamp.
updated_at
string
ISO 8601 last update timestamp.

Authorizations

Api-Key
string
header
required

Query Parameters

access_token
string | null

Body

application/json
document
MIMEData · object
required

The document to extract from

json_schema
Json Schema · object
required

JSON schema describing the structured output

model
string
default:retab-small

The model to use for the extraction

image_resolution_dpi
integer
default:192

Resolution of the image sent to the LLM

Required range: 96 <= x <= 300
chunking_keys
Chunking Keys · object

Parallel OCR chunking keys for long list fields

Example:
{
"products": "identity.id",
"properties": "ID"
}
context
unknown
n_consensus
integer
default:1

Number of consensus extraction runs to perform. Uses deterministic single-pass when set to 1.

Required range: 1 <= x <= 16
metadata
Metadata · object

User-defined metadata to associate with this extraction

bust_cache
boolean
default:false

If true, skip the LLM cache and force a fresh completion

stream
boolean
default:false
additional_messages
Additional Messages · object[] | null

Response

Successful Response

Backend-internal extraction record with organization scoping.

file
ExtractionFile · object
required

Backend-internal file reference (adds gcs_path)

model
string
required

Model used for the extraction

json_schema
Json Schema · object
required

JSON schema used for the extraction

output
Output · object
required

The extracted structured data

organization_id
string
required

Organization ID of the user or application

id
string

Unique identifier of the extraction

n_consensus
integer
default:1

Number of consensus votes used

image_resolution_dpi
integer
default:192

DPI used to render document images

chunking_keys
Chunking Keys · object

Parallel OCR chunking keys for long list fields

context
string | null

Additional context supplied with the extraction request

consensus
ExtractionConsensus · object

Consensus metadata for multi-vote extraction runs

origin
ProcessingRequestOrigin · object

Origin of the extraction request

metadata
Metadata · object
usage
RetabUsage · object

Usage information for the extraction

created_at
string<date-time>
updated_at
string<date-time>