Skip to main content

Introduction

splits.create assigns pages in a multi-page document to named subdocuments. Each result contains only:
  • name
  • pages
That is the full mental model for split. It is a document-labeling primitive, not a key-based grouping primitive. Use split when one file contains different document types, sections, or repeated subdocument instances and you want to know which pages belong to each one. Common use cases include:
  1. Mixed document batches: Separate invoices, receipts, contracts, and cover letters from one uploaded PDF.
  2. Report section detection: Find the executive summary, appendix, or financial section inside a long report.
  3. Repeated instances: Detect repeated occurrences of the same subdocument type with allow_multiple_instances=True.
  4. Workflow routing: Route each detected subdocument type into its own downstream extraction or review branch.
Key features of the Split API:
  • Named subdocuments: Define the labels you care about with natural-language descriptions.
  • Page-level output: Results are explicit 1-indexed page arrays.
  • Repeated instances: The same name can appear multiple times in output.
  • Consensus support: Increase n_consensus to get consensus.likelihoods and consensus.choices.
  • Pure split primitive: Key-based grouping is handled by partitions.create, not by split.

Split API

SplitRequest
SplitRequest
Returns
Split
A persisted split resource containing the page assignments.
Define the subdocument labels you want to detect, then pass them directly into splits.create.
from retab import Retab

client = Retab()

result = client.splits.create(
    document="property_portfolio.pdf",
    model="retab-small",
    subdocuments=[
        {
            "name": "property_listing",
            "description": "Property listing pages with photos, pricing, and listing details",
            "allow_multiple_instances": True,
        },
        {
            "name": "legal_notice",
            "description": "Legal notices, disclaimers, or policy pages",
        },
    ],
    n_consensus=3,
)

for split in result.output:
    print(split.name, split.pages)

print(result.consensus.likelihoods if result.consensus else None)

Use Case: Processing Mixed Document Batches

Split a batch of scanned documents into invoices, receipts, and contracts for separate downstream handling.
from retab import Retab

client = Retab()

subdocuments = [
    {"name": "invoice", "description": "Invoice documents with billing details, line items, totals, and payment terms"},
    {"name": "receipt", "description": "Payment receipts showing transaction confirmation and amounts paid"},
    {"name": "contract", "description": "Legal contracts with terms, conditions, and signature blocks"},
    {"name": "cover_letter", "description": "Cover letters or transmittal documents"},
]

result = client.splits.create(
    document="scanned_batch.pdf",
    model="retab-small",
    subdocuments=subdocuments,
)

for split in result.output:
    print(f"{split.name}: pages {split.pages}")

Understanding Repeated Instances

The same subdocument can appear more than once in a response. This is useful for interleaved packets or repeated document types inside one file.
result = client.splits.create(
    document="mixed_batch.pdf",
    model="retab-small",
    subdocuments=[
        {
            "name": "invoice",
            "description": "Invoice documents",
            "allow_multiple_instances": True,
        },
        {
            "name": "receipt",
            "description": "Receipt documents",
            "allow_multiple_instances": True,
        },
    ],
)

for split in result.output:
    print(f"{split.name}: pages {split.pages}")

# Example output:
# invoice: pages [1, 2, 3]
# receipt: pages [4, 5]
# invoice: pages [6, 7, 8]
# receipt: pages [9, 10]

Split vs Partition

Use split when the question is:
  • “What kind of subdocument is on these pages?”
  • “Where are the invoices, receipts, and contracts in this file?”
Use partition when the question is:
  • “Which pages belong to invoice INV-001 versus INV-002?”
  • “Group this homogeneous packet by claim ID / policy number / invoice number.”
Example:
  • split: classify a mixed packet into invoice, receipt, contract
  • partition: group an invoice-only batch into one chunk per invoice_number
If you need both:
  1. Run split first to isolate the relevant subdocument type.
  2. Run partitions.create on that subdocument’s pages.

Best Practices

Subdocument Definitions

  • Be specific: Write descriptions that distinguish labels clearly.
  • Use visual cues: Mention headers, logos, tables, signatures, or layouts.
  • Avoid overlap: Overlapping labels reduce routing accuracy.
  • Prefer 3-7 labels: Too many similar labels usually hurts quality.

Model Selection

  • retab-small: Good default for most split tasks
  • Raise n_consensus when boundary accuracy matters more than latency

Pipeline Design

  • Use split before extract when a bundle must be separated first.
  • Use partition only for key-based grouping, not as a subdocument definition feature.