Splitting - Retab Docs

Introduction

splits.create assigns pages in a multi-page document to named subdocuments. Each result contains only:

name
pages

That is the full mental model for split. It is a document-labeling primitive, not a key-based grouping primitive. Use split when one file contains different document types, sections, or repeated subdocument instances and you want to know which pages belong to each one. Common use cases include:

Mixed document batches: Separate invoices, receipts, contracts, and cover letters from one uploaded PDF.
Report section detection: Find the executive summary, appendix, or financial section inside a long report.
Repeated instances: Detect repeated occurrences of the same subdocument type with allow_multiple_instances=True.
Workflow routing: Route each detected subdocument type into its own downstream extraction or review branch.

Key features of the Split API:

Named subdocuments: Define the labels you care about with natural-language descriptions.
Page-level output: Results are explicit 1-indexed page arrays.
Repeated instances: The same name can appear multiple times in output.
Consensus support: Increase n_consensus to get consensus.likelihoods and consensus.choices.
Pure split primitive: Key-based grouping is handled by partitions.create, not by split.

Split API

SplitRequest

Show properties

document

MIMEData

required

The document to split. The HTTP API accepts MIMEData. The SDKs also accept convenient local inputs such as file paths, file-like objects, images, buffers, and URLs, then convert them for you.

model

LLMModel

required

The model to use for document splitting. Recommended default: retab-small.

subdocuments

array[Subdocument]

required

List of subdocuments to classify the document into. Each subdocument has: - name: Unique identifier for the subdocument - description: Detailed description to help the model identify this subdocument - partition_key (optional): Key used to partition repeated instances inside this subdocument - allow_overlap (optional, default true): Set to false when partition chunks for this subdocument must be exclusive - allow_multiple_instances (optional): Set to true when this subdocument type can appear more than once in the document and you want each distinct instance detected separately

instructions

string

Free-form instructions appended to the system prompt to steer the split.

n_consensus

integer

Number of split passes to run before building the final answer. Leave it at 1 for the fastest deterministic pass, or raise it when boundary quality is business-critical and you want consensus.likelihoods and consensus.choices.

Returns

Split

A persisted split resource containing the page assignments.

Show properties

string

Unique identifier of the split record.

output

array[SplitResult]

List of split results, each containing: - name: The subdocument label - pages: List of 1-indexed page numbers assigned to that split

consensus

SplitConsensus | null

Present when n_consensus > 1 and contains: - likelihoods: A tree aligned with output, with confidence for name and for each page leaf - choices: One entry per consensus run

Recommended Workflow

Define the subdocument labels you want to detect, then pass them directly into splits.create.

from retab import Retab

client = Retab()

result = client.splits.create(
    document="property_portfolio.pdf",
    model="retab-small",
    subdocuments=[
        {
            "name": "property_listing",
            "description": "Property listing pages with photos, pricing, and listing details",
            "allow_multiple_instances": True,
        },
        {
            "name": "legal_notice",
            "description": "Legal notices, disclaimers, or policy pages",
        },
    ],
    n_consensus=3,
)

for split in result.output:
    print(split.name, split.pages)

print(result.consensus.likelihoods if result.consensus else None)

Use Case: Processing Mixed Document Batches

Split a batch of scanned documents into invoices, receipts, and contracts for separate downstream handling.

from retab import Retab

client = Retab()

subdocuments = [
    {"name": "invoice", "description": "Invoice documents with billing details, line items, totals, and payment terms"},
    {"name": "receipt", "description": "Payment receipts showing transaction confirmation and amounts paid"},
    {"name": "contract", "description": "Legal contracts with terms, conditions, and signature blocks"},
    {"name": "cover_letter", "description": "Cover letters or transmittal documents"},
]

result = client.splits.create(
    document="scanned_batch.pdf",
    model="retab-small",
    subdocuments=subdocuments,
)

for split in result.output:
    print(f"{split.name}: pages {split.pages}")

Understanding Repeated Instances

The same subdocument can appear more than once in a response. This is useful for interleaved packets or repeated document types inside one file.

result = client.splits.create(
    document="mixed_batch.pdf",
    model="retab-small",
    subdocuments=[
        {
            "name": "invoice",
            "description": "Invoice documents",
            "allow_multiple_instances": True,
        },
        {
            "name": "receipt",
            "description": "Receipt documents",
            "allow_multiple_instances": True,
        },
    ],
)

for split in result.output:
    print(f"{split.name}: pages {split.pages}")

# Example output:
# invoice: pages [1, 2, 3]
# receipt: pages [4, 5]
# invoice: pages [6, 7, 8]
# receipt: pages [9, 10]

Split vs Partition

Use split when the question is:

“What kind of subdocument is on these pages?”
“Where are the invoices, receipts, and contracts in this file?”

Use partition when the question is:

“Which pages belong to invoice INV-001 versus INV-002?”
“Group this homogeneous packet by claim ID / policy number / invoice number.”

Example:

split: classify a mixed packet into invoice, receipt, contract
partition: group an invoice-only batch into one chunk per invoice_number

If you need both:

Run split first to isolate the relevant subdocument type.
Run partitions.create on that subdocument’s pages.

Best Practices

Subdocument Definitions

Be specific: Write descriptions that distinguish labels clearly.
Use visual cues: Mention headers, logos, tables, signatures, or layouts.
Avoid overlap: Overlapping labels reduce routing accuracy.
Prefer 3-7 labels: Too many similar labels usually hurts quality.

Model Selection

retab-small: Good default for most split tasks
Raise n_consensus when boundary accuracy matters more than latency

Pipeline Design

Use split before extract when a bundle must be separated first.
Use partition only for key-based grouping, not as a subdocument definition feature.

​Introduction

​Split API

​Recommended Workflow

​Use Case: Processing Mixed Document Batches

​Understanding Repeated Instances

​Split vs Partition

​Best Practices

​Subdocument Definitions

​Model Selection

​Pipeline Design

Introduction

Split API

Recommended Workflow

Use Case: Processing Mixed Document Batches

Understanding Repeated Instances

Split vs Partition

Best Practices

Subdocument Definitions

Model Selection

Pipeline Design