Skip to main content

Introduction

The classify method in Retab’s document processing pipeline analyzes a document and classifies it into exactly one of the user-defined categories. Unlike the split method which identifies multiple sections within a document, classify determines the single most appropriate category for the entire document, returning the classification along with chain-of-thought reasoning explaining the decision. Common use cases include:
  1. Document Routing: Automatically route incoming documents to the appropriate processing pipeline based on their type
  2. Pre-filtering: Classify documents before extraction to apply the correct schema
  3. Mailroom Automation: Sort incoming mail and attachments by document type
  4. Quality Control: Verify document types match expected categories in a workflow
Key features of the Classify API:
  • Single Classification: Returns exactly one category for the entire document
  • Chain-of-Thought Reasoning: Provides detailed explanation for the classification decision
  • Vision-Based Analysis: Uses LLM vision capabilities for accurate document understanding
  • Flexible Categories: Define custom categories tailored to your document types
  • Explainable Results: Understand why a document was classified a certain way

Classify API

ClassifyRequest
ClassifyRequest
Returns
ClassifyResponse Object
A ClassifyResponse object containing the classification result with reasoning.

Use Case: Document Routing in a Processing Pipeline

Classify incoming documents to route them to the appropriate extraction schema.
from retab import Retab

client = Retab()

# Define document type categories
categories = [
    {"name": "invoice", "description": "Invoice documents with billing details, line items, totals, and payment terms"},
    {"name": "receipt", "description": "Payment receipts showing transaction confirmation and amounts paid"},
    {"name": "contract", "description": "Legal contracts with terms, conditions, and signature blocks"},
    {"name": "purchase_order", "description": "Purchase order documents with order details and shipping information"},
]

# Classify the document
result = client.documents.classify(
    document="incoming_document.pdf",
    model="retab-small",
    categories=categories
)

print(f"Document classified as: {result.result.classification}")
print(f"Reasoning: {result.result.reasoning}")

# Route to appropriate extraction pipeline
if result.result.classification == "invoice":
    # Use invoice extraction schema
    invoice_schema = {...}
    extraction = client.documents.extract(
        document="incoming_document.pdf",
        model="retab-small",
        json_schema=invoice_schema
    )
elif result.result.classification == "contract":
    # Use contract extraction schema
    contract_schema = {...}
    extraction = client.documents.extract(
        document="incoming_document.pdf",
        model="retab-small",
        json_schema=contract_schema
    )

Use Case: Email Attachment Filtering

Classify email attachments to filter out irrelevant documents.
from retab import Retab

client = Retab()

# Define categories for email attachments
categories = [
    {"name": "invoice", "description": "Invoice or billing documents requiring payment"},
    {"name": "quote", "description": "Price quotes or proposals from vendors"},
    {"name": "marketing", "description": "Marketing materials, brochures, or promotional content"},
    {"name": "other", "description": "Miscellaneous documents not fitting other categories"},
]

# Process each attachment
for attachment in email_attachments:
    result = client.documents.classify(
        document=attachment,
        model="retab-small",
        categories=categories
    )
    
    if result.result.classification == "invoice":
        # High priority - process immediately
        process_invoice(attachment)
    elif result.result.classification == "quote":
        # Medium priority - queue for review
        queue_for_review(attachment)
    elif result.result.classification == "marketing":
        # Low priority - skip or archive
        archive_document(attachment)
    
    print(f"{attachment.name}: {result.result.classification}")
    print(f"  Reason: {result.result.reasoning[:100]}...")

Classify vs Split: When to Use Each

FeatureClassifySplit
PurposeCategorize entire documentIdentify sections within document
OutputSingle categoryMultiple page ranges
Use CaseDocument routing, filteringBatch separation, section extraction
InputAny documentTypically multi-page documents
ResultOne classification with reasoningList of sections with page ranges
Use Classify when:
  • You need to determine what type of document you have
  • Documents are single-purpose (one invoice, one contract, etc.)
  • Building a document routing or triage system
  • Pre-filtering before extraction
Use Split when:
  • Documents contain multiple sections of different types
  • Processing batched/combined PDFs
  • Need to locate specific sections within a document
  • Extracting page ranges for further processing

Best Practices

Category Definition

  • Be Specific: Provide detailed descriptions that distinguish categories clearly
  • Use Visual Cues: Mention distinctive visual elements (logos, headers, layouts)
  • Include Examples: Reference typical content found in each category
  • Add Catch-all: Consider an “other” category for documents that don’t fit

Leveraging Reasoning

  • Quality Assurance: Use reasoning to validate classification decisions
  • Audit Trail: Store reasoning for compliance and debugging
  • Confidence Assessment: Longer, more detailed reasoning often indicates higher confidence
  • Human Review: Flag documents with ambiguous reasoning for manual review

Model Selection

  • retab-small: Best balance of speed and accuracy for most use cases
  • retab-large: Higher accuracy for complex or ambiguous documents
  • retab-micro: Fastest option for high-volume, straightforward classification

Performance Tips

  • Limit Categories: Use 3-7 well-defined categories for best accuracy
  • Test Descriptions: Iterate on category descriptions to improve classification
  • Parallel Processing: Classify multiple documents concurrently for higher throughput
  • Use first_n_pages: For large documents where classification can be determined from early pages, use first_n_pages to limit processing to the first N pages, reducing latency and cost