Introduction
Theclassify method in Retab’s document processing pipeline analyzes a document and classifies it into exactly one of the user-defined categories. Unlike the split method which identifies multiple sections within a document, classify determines the single most appropriate category for the entire document, returning the classification along with chain-of-thought reasoning explaining the decision.
Common use cases include:
- Document Routing: Automatically route incoming documents to the appropriate processing pipeline based on their type
- Pre-filtering: Classify documents before extraction to apply the correct schema
- Mailroom Automation: Sort incoming mail and attachments by document type
- Quality Control: Verify document types match expected categories in a workflow
- Single Classification: Returns exactly one category for the entire document
- Chain-of-Thought Reasoning: Provides detailed explanation for the classification decision
- Vision-Based Analysis: Uses LLM vision capabilities for accurate document understanding
- Flexible Categories: Define custom categories tailored to your document types
- Explainable Results: Understand why a document was classified a certain way
Classify API
A ClassifyResponse object containing the classification result with reasoning.
Use Case: Document Routing in a Processing Pipeline
Classify incoming documents to route them to the appropriate extraction schema.Use Case: Email Attachment Filtering
Classify email attachments to filter out irrelevant documents.Classify vs Split: When to Use Each
| Feature | Classify | Split |
|---|---|---|
| Purpose | Categorize entire document | Identify sections within document |
| Output | Single category | Multiple page ranges |
| Use Case | Document routing, filtering | Batch separation, section extraction |
| Input | Any document | Typically multi-page documents |
| Result | One classification with reasoning | List of sections with page ranges |
- You need to determine what type of document you have
- Documents are single-purpose (one invoice, one contract, etc.)
- Building a document routing or triage system
- Pre-filtering before extraction
- Documents contain multiple sections of different types
- Processing batched/combined PDFs
- Need to locate specific sections within a document
- Extracting page ranges for further processing
Best Practices
Category Definition
- Be Specific: Provide detailed descriptions that distinguish categories clearly
- Use Visual Cues: Mention distinctive visual elements (logos, headers, layouts)
- Include Examples: Reference typical content found in each category
- Add Catch-all: Consider an “other” category for documents that don’t fit
Leveraging Reasoning
- Quality Assurance: Use reasoning to validate classification decisions
- Audit Trail: Store reasoning for compliance and debugging
- Confidence Assessment: Longer, more detailed reasoning often indicates higher confidence
- Human Review: Flag documents with ambiguous reasoning for manual review
Model Selection
retab-small: Best balance of speed and accuracy for most use casesretab-large: Higher accuracy for complex or ambiguous documentsretab-micro: Fastest option for high-volume, straightforward classification
Performance Tips
- Limit Categories: Use 3-7 well-defined categories for best accuracy
- Test Descriptions: Iterate on category descriptions to improve classification
- Parallel Processing: Classify multiple documents concurrently for higher throughput
- Use
first_n_pages: For large documents where classification can be determined from early pages, usefirst_n_pagesto limit processing to the first N pages, reducing latency and cost