Create Extraction - Retab Docs

from retab import Retab

client = Retab()

extraction = client.extractions.create(
    document="Invoice.pdf",
    json_schema="Invoice_schema.json",
    model="retab-small",
    n_consensus=1,
    metadata={"source": "docs"},
)

print(f"Extraction ID: {extraction.id}")
print(f"Filename: {extraction.file.filename}")
print(extraction.output)

{
  "id": "extr_01G34H8J2K",
  "file": {
    "id": "file_6dd6eb00688ad8d1",
    "filename": "Invoice.pdf",
    "mime_type": "application/pdf"
  },
  "model": "retab-small",
  "json_schema": {
    "type": "object",
    "properties": {
      "invoice_number": { "type": "string" },
      "total": { "type": "number" }
    }
  },
  "n_consensus": 1,
  "image_resolution_dpi": 192,
  "output": {
    "invoice_number": "INV-2024-0042",
    "total": 1234.56
  },
  "consensus": {
    "choices": [],
    "likelihoods": {
      "invoice_number": 0.98,
      "total": 0.97
    }
  },
  "metadata": {
    "source": "docs"
  },
  "usage": {
    "prompt_tokens": 2760,
    "completion_tokens": 20,
    "total_tokens": 2780
  },
  "created_at": "2024-03-15T10:30:00Z"
}

POST

extractions

from retab import Retab

client = Retab()

extraction = client.extractions.create(
    document="Invoice.pdf",
    json_schema="Invoice_schema.json",
    model="retab-small",
    n_consensus=1,
    metadata={"source": "docs"},
)

print(f"Extraction ID: {extraction.id}")
print(f"Filename: {extraction.file.filename}")
print(extraction.output)

{
  "id": "extr_01G34H8J2K",
  "file": {
    "id": "file_6dd6eb00688ad8d1",
    "filename": "Invoice.pdf",
    "mime_type": "application/pdf"
  },
  "model": "retab-small",
  "json_schema": {
    "type": "object",
    "properties": {
      "invoice_number": { "type": "string" },
      "total": { "type": "number" }
    }
  },
  "n_consensus": 1,
  "image_resolution_dpi": 192,
  "output": {
    "invoice_number": "INV-2024-0042",
    "total": 1234.56
  },
  "consensus": {
    "choices": [],
    "likelihoods": {
      "invoice_number": 0.98,
      "total": 0.97
    }
  },
  "metadata": {
    "source": "docs"
  },
  "usage": {
    "prompt_tokens": 2760,
    "completion_tokens": 20,
    "total_tokens": 2780
  },
  "created_at": "2024-03-15T10:30:00Z"
}

Extract structured data from a document against a JSON schema and persist the result as an Extraction resource that can later be retrieved via GET /v1/extractions/{extraction_id} or listed via GET /v1/extractions.

from retab import Retab

client = Retab()

extraction = client.extractions.create(
    document="Invoice.pdf",
    json_schema="Invoice_schema.json",
    model="retab-small",
    n_consensus=1,
    metadata={"source": "docs"},
)

print(f"Extraction ID: {extraction.id}")
print(f"Filename: {extraction.file.filename}")
print(extraction.output)

{
  "id": "extr_01G34H8J2K",
  "file": {
    "id": "file_6dd6eb00688ad8d1",
    "filename": "Invoice.pdf",
    "mime_type": "application/pdf"
  },
  "model": "retab-small",
  "json_schema": {
    "type": "object",
    "properties": {
      "invoice_number": { "type": "string" },
      "total": { "type": "number" }
    }
  },
  "n_consensus": 1,
  "image_resolution_dpi": 192,
  "output": {
    "invoice_number": "INV-2024-0042",
    "total": 1234.56
  },
  "consensus": {
    "choices": [],
    "likelihoods": {
      "invoice_number": 0.98,
      "total": 0.97
    }
  },
  "metadata": {
    "source": "docs"
  },
  "usage": {
    "prompt_tokens": 2760,
    "completion_tokens": 20,
    "total_tokens": 2780
  },
  "created_at": "2024-03-15T10:30:00Z"
}

Authorizations

Api-Key

string

header

required

Body

application/json

Request to run a structured extraction on a single document.

Extends the base extraction request with the document to process (either inline content or a reference to a previously uploaded file) and a stream flag that controls whether results are returned incrementally.

document

MIMEData · object

required

A file represented by its filename and a base64 data url.

MIMEData
FileRef

Show child attributes

json_schema

Json Schema · object

required

JSON schema describing the structured output

model

string

default:retab-small

The model to use for the extraction

image_resolution_dpi

integer

default:192

Resolution of the image sent to the LLM

instructions

string | null

Free-form instructions appended to the system prompt to steer the extraction.

n_consensus

integer

default:1

Number of consensus extraction runs to perform. Uses deterministic single-pass when set to 1.

metadata

Metadata · object

User-defined metadata to associate with this extraction

Show child attributes

additional_messages

Additional Messages · object[] | null

Additional chat messages forwarded to the extraction model.

bust_cache

boolean

default:false

If true, skip the LLM cache and force a fresh completion

stream

boolean

default:false

background

boolean

default:false

If true, run asynchronously: returns immediately with status 'queued' and an empty output. Poll GET /v1//{id} until status is terminal. Mutually exclusive with stream.

chunking_keys

Chunking Keys · object

Show child attributes

Response

Successful Response

A stored extraction record from the Retab API.

string

required

Unique identifier of the extraction

file

FileRef · object

required

Information about the extracted file

Show child attributes

model

string

required

Model used for the extraction

json_schema

Json Schema · object

required

JSON schema used for the extraction

output

Output · object

required

The extracted structured data

n_consensus

integer

default:1

Number of consensus votes used

image_resolution_dpi

integer

default:192

DPI used to render document images

instructions

string | null

Free-form instructions supplied with the extraction request.

status

enum<string>

default:pending

Lifecycle status. The synchronous path returns 'completed'. Background runs progress pending -> queued -> in_progress -> completed | failed | cancelled.

Available options:

pending,

queued,

in_progress,

completed,

failed,

cancelled

error

PrimitiveError · object

Error details when a background run fails; null otherwise. Always present so consumers can read it without an existence check.

Show child attributes

consensus

ExtractionConsensus · object

Consensus metadata for multi-vote extraction runs

Show child attributes

metadata

Metadata · object

Show child attributes

usage

RetabUsage · object

Usage information for the extraction

Show child attributes

created_at

string<date-time> | null

Introduction Get Extraction

Documentation Index

Authorizations

Body

Response