Introduction
splits.create assigns pages in a multi-page document to named subdocuments. Each result contains only:
namepages
split when one file contains different document types, sections, or repeated subdocument instances and you want to know which pages belong to each one.
Common use cases include:
- Mixed document batches: Separate invoices, receipts, contracts, and cover letters from one uploaded PDF.
- Report section detection: Find the executive summary, appendix, or financial section inside a long report.
- Repeated instances: Detect repeated occurrences of the same subdocument type with
allow_multiple_instances=True. - Workflow routing: Route each detected subdocument type into its own downstream extraction or review branch.
- Named subdocuments: Define the labels you care about with natural-language descriptions.
- Page-level output: Results are explicit 1-indexed page arrays.
- Repeated instances: The same
namecan appear multiple times inoutput. - Consensus support: Increase
n_consensusto getconsensus.likelihoodsandconsensus.choices. - Pure split primitive: Key-based grouping is handled by
partitions.create, not bysplit.
Split API
A persisted split resource containing the page assignments.
Recommended Workflow
Define the subdocument labels you want to detect, then pass them directly intosplits.create.
Use Case: Processing Mixed Document Batches
Split a batch of scanned documents into invoices, receipts, and contracts for separate downstream handling.Understanding Repeated Instances
The same subdocument can appear more than once in a response. This is useful for interleaved packets or repeated document types inside one file.Split vs Partition
Usesplit when the question is:
- “What kind of subdocument is on these pages?”
- “Where are the invoices, receipts, and contracts in this file?”
partition when the question is:
- “Which pages belong to invoice INV-001 versus INV-002?”
- “Group this homogeneous packet by claim ID / policy number / invoice number.”
split: classify a mixed packet intoinvoice,receipt,contractpartition: group an invoice-only batch into one chunk perinvoice_number
- Run
splitfirst to isolate the relevant subdocument type. - Run
partitions.createon that subdocument’s pages.
Best Practices
Subdocument Definitions
- Be specific: Write descriptions that distinguish labels clearly.
- Use visual cues: Mention headers, logos, tables, signatures, or layouts.
- Avoid overlap: Overlapping labels reduce routing accuracy.
- Prefer 3-7 labels: Too many similar labels usually hurts quality.
Model Selection
retab-small: Good default for most split tasks- Raise
n_consensuswhen boundary accuracy matters more than latency
Pipeline Design
- Use
splitbeforeextractwhen a bundle must be separated first. - Use
partitiononly for key-based grouping, not as a subdocument definition feature.