Introduction
Thesplit method in Retab’s document processing pipeline analyzes a multi-page document and classifies pages into user-defined subdocuments. Each result returns a name and the exact pages assigned to that subdocument. The same subdocument can appear multiple times in one response, which makes split useful for mixed batches, merged PDFs, and long documents with repeated sections.
The SDK also exposes documents.generate_split_config(...), which can propose the initial subdocument list for you before you run documents.split(...).
Common use cases include:
- Document Separation: Split a combined PDF containing multiple invoices, receipts, or contracts into individual sections
- Content Classification: Identify and locate different sections within legal documents, reports, or manuals
- Batch Processing: Process scanned document batches and organize them by document type
- Workflow Automation: Route different document types to appropriate processing pipelines
- Multi-Subdocument Support: Define multiple subdocuments with descriptions for accurate classification
- Discontinuous Sections: Same subdocument can appear multiple times for non-contiguous content
- Page-Level Precision: Get the exact list of 1-indexed pages for each section
- Vision-Based Analysis: Uses LLM vision capabilities for accurate page classification
- Flexible Subdocuments: Define custom subdocuments tailored to your document types
- Partition Detection: Add a
partition_keyto break one subdocument into repeated items - Consensus Scoring: Increase
n_consensusto get confidence signals likelikelihoodandvotes - Config Generation: Use
generate_split_configto bootstrap subdocuments from a sample file
Split API
A SplitResponse object containing the classified sections with their assigned pages.
Recommended Workflow
If you do not already know the right subdocument list, start withgenerate_split_config and then pass the generated subdocuments into split.
Use Case: Processing Mixed Document Batches
Split a batch of scanned documents into individual invoices, receipts, and contracts for separate processing.Use Case: Extracting Specific Sections from Reports
Identify and locate specific sections within a large report or manual.Understanding Discontinuous Sections
The Split API correctly handles cases where the same subdocument appears multiple times in a document. This is common when documents are interleaved or when similar content appears in different parts of a document.Use Case: Partitioning by Key
When processing documents that contain multiple items of the same type (e.g., multiple invoices, multiple property listings), use thepartition_key parameter to identify and separate individual items within a subdocument.
Sub-Page Precision with Partitions
The Split API provides sub-page level precision through thepartitions field. Each partition includes Y-coordinates that specify exactly where content starts and ends within pages, enabling precise extraction even when document sections don’t align with page boundaries.
0.0represents the top of the page1.0represents the bottom of the pagefirst_page_y_startindicates where content begins on the first page of the partitionlast_page_y_endindicates where content ends on the last page of the partition
Best Practices
Subdocument Definition
- Be Specific: Provide detailed descriptions that distinguish subdocuments clearly
- Use Visual Cues: Mention distinctive visual elements (logos, headers, layouts)
- Include Examples: Reference typical content found in each subdocument
- Avoid Overlap: Ensure subdocuments are mutually exclusive when possible
Model Selection
retab-large: Best balance of speed and accuracy for most use casesretab-small: Higher accuracy for complex or ambiguous documentsretab-micro: Alternative for specific document types
Performance Tips
- Batch Similar Documents: Group similar document types for consistent results
- Limit Subdocuments: Use 3-7 well-defined subdocuments for best accuracy
- Test Descriptions: Iterate on subdocument descriptions to improve classification