Introduction
Thesplit method in Retab’s document processing pipeline analyzes multi-page documents and classifies pages into user-defined categories, returning the page ranges for each section. This endpoint is ideal for processing batches of mixed documents, separating combined PDFs, and organizing document collections by content type.
Common use cases include:
- Document Separation: Split a combined PDF containing multiple invoices, receipts, or contracts into individual sections
- Content Classification: Identify and locate different sections within legal documents, reports, or manuals
- Batch Processing: Process scanned document batches and organize them by document type
- Workflow Automation: Route different document types to appropriate processing pipelines
- Multi-Category Support: Define multiple categories with descriptions for accurate classification
- Discontinuous Sections: Same category can appear multiple times for non-contiguous content
- Page-Level Precision: Get exact start and end pages for each section
- Vision-Based Analysis: Uses LLM vision capabilities for accurate page classification
- Flexible Categories: Define custom categories tailored to your document types
Split API
A SplitResponse object containing the classified sections with their page ranges.
Use Case: Processing Mixed Document Batches
Split a batch of scanned documents into individual invoices, receipts, and contracts for separate processing.Use Case: Extracting Specific Sections from Reports
Identify and locate specific sections within a large report or manual.Understanding Discontinuous Sections
The Split API correctly handles cases where the same category appears multiple times in a document. This is common when documents are interleaved or when similar content appears in different parts of a document.Best Practices
Category Definition
- Be Specific: Provide detailed descriptions that distinguish categories clearly
- Use Visual Cues: Mention distinctive visual elements (logos, headers, layouts)
- Include Examples: Reference typical content found in each category
- Avoid Overlap: Ensure categories are mutually exclusive when possible
Model Selection
gemini-2.5-flash: Best balance of speed and accuracy for most use casesgemini-2.5-pro: Higher accuracy for complex or ambiguous documentsgpt-4.1: Alternative for specific document types
Performance Tips
- Batch Similar Documents: Group similar document types for consistent results
- Limit Categories: Use 3-7 well-defined categories for best accuracy
- Test Descriptions: Iterate on category descriptions to improve classification