Classify

Introduction

The classify method in Retab’s document processing pipeline analyzes a document and classifies it into exactly one of the user-defined categories.
Unlike multi-section routing workflows, classify determines the single most appropriate category for the entire document and returns the classification with reasoning. Common use cases include:

Document Routing: Automatically route incoming documents to the appropriate processing pipeline based on their type
Pre-filtering: Classify documents before extraction to apply the correct schema
Mailroom Automation: Sort incoming mail and attachments by document type
Quality Control: Verify document types match expected categories in a workflow

Key features of the Classify API:

Single Classification: Returns exactly one category for the entire document
Chain-of-Thought Reasoning: Provides detailed explanation for the classification decision
Vision-Based Analysis: Uses LLM vision capabilities for accurate document understanding
Flexible Categories: Define custom categories tailored to your document types
Explainable Results: Understand why a document was classified a certain way

Classify API

ClassifyRequest

Show properties

document

MIMEData

required

The document to classify. Can be a file path, bytes, or PIL.Image.Image object.

model

LLMModel

required

The AI model to use for document classification. Recommended: retab-small for best balance of speed and accuracy.

Use Case: Document Routing in a Processing Pipeline

Classify incoming documents to route them to the appropriate extraction schema.

from retab import Retab

client = Retab()

# Define document type categories
categories = [
    {"name": "invoice", "description": "Invoice documents with billing details, line items, totals, and payment terms"},
    {"name": "receipt", "description": "Payment receipts showing transaction confirmation and amounts paid"},
    {"name": "contract", "description": "Legal contracts with terms, conditions, and signature blocks"},
    {"name": "purchase_order", "description": "Purchase order documents with order details and shipping information"},
]

# Classify the document
result = client.documents.classify(
    document="incoming_document.pdf",
    model="retab-small",
    categories=categories
)

print(f"Document classified as: {result.result.classification}")
print(f"Reasoning: {result.result.reasoning}")

# Route to appropriate extraction pipeline
if result.result.classification == "invoice":
    # Use invoice extraction schema
    invoice_schema = {...}
    extraction = client.documents.extract(
        document="incoming_document.pdf",
        model="retab-small",
        json_schema=invoice_schema
    )
elif result.result.classification == "contract":
    # Use contract extraction schema
    contract_schema = {...}
    extraction = client.documents.extract(
        document="incoming_document.pdf",
        model="retab-small",
        json_schema=contract_schema
    )

Use Case: Email Attachment Filtering

Classify email attachments to filter out irrelevant documents.

from retab import Retab

client = Retab()

# Define categories for email attachments
categories = [
    {"name": "invoice", "description": "Invoice or billing documents requiring payment"},
    {"name": "quote", "description": "Price quotes or proposals from vendors"},
    {"name": "marketing", "description": "Marketing materials, brochures, or promotional content"},
    {"name": "other", "description": "Miscellaneous documents not fitting other categories"},
]

# Process each attachment
for attachment in email_attachments:
    result = client.documents.classify(
        document=attachment,
        model="retab-small",
        categories=categories
    )
    
    if result.result.classification == "invoice":
        # High priority - process immediately
        process_invoice(attachment)
    elif result.result.classification == "quote":
        # Medium priority - queue for review
        queue_for_review(attachment)
    elif result.result.classification == "marketing":
        # Low priority - skip or archive
        archive_document(attachment)
    
    print(f"{attachment.name}: {result.result.classification}")
    print(f"  Reason: {result.result.reasoning[:100]}...")

Classify vs Split: When to Use Each

Feature	Classify	Split
Purpose	Categorize entire document	Identify sections within document
Output	Single category	Multiple page ranges
Use Case	Document routing, filtering	Batch separation, section extraction
Input	Any document	Typically multi-page documents
Result	One classification with reasoning	List of sections with page ranges

Use Classify when:

You need to determine what type of document you have
Documents are single-purpose (one invoice, one contract, etc.)
Building a document routing or triage system
Pre-filtering before extraction

Use Split when:

Documents contain multiple sections of different types
Processing batched/combined PDFs
Need to locate specific sections within a document
Extracting page ranges for further processing

Best Practices

Category Definition

Be Specific: Provide detailed descriptions that distinguish categories clearly
Use Visual Cues: Mention distinctive visual elements (logos, headers, layouts)
Include Examples: Reference typical content found in each category
Add Catch-all: Consider an “other” category for documents that don’t fit

Leveraging Reasoning

Quality Assurance: Use reasoning to validate classification decisions
Audit Trail: Store reasoning for compliance and debugging
Confidence Assessment: Longer, more detailed reasoning often indicates higher confidence
Human Review: Flag documents with ambiguous reasoning for manual review

Model Selection

retab-small: Best balance of speed and accuracy for most use cases
retab-large: Higher accuracy for complex or ambiguous documents
retab-micro: Fastest option for high-volume, straightforward classification

Performance Tips

Limit Categories: Use 3-7 well-defined categories for best accuracy
Test Descriptions: Iterate on category descriptions to improve classification
Parallel Processing: Classify multiple documents concurrently for higher throughput
Use first_n_pages: For large documents where classification can be determined from early pages, use first_n_pages to limit processing to the first N pages, reducing latency and cost

Overview

Core Concepts

Consensus

Introduction

Classify API

Use Case: Document Routing in a Processing Pipeline

Use Case: Email Attachment Filtering

Classify vs Split: When to Use Each

Best Practices

Category Definition

Leveraging Reasoning

Model Selection

Performance Tips

Overview

Core Concepts

Consensus

​Introduction

​Classify API

​Use Case: Document Routing in a Processing Pipeline

​Use Case: Email Attachment Filtering

​Classify vs Split: When to Use Each

​Best Practices

​Category Definition

​Leveraging Reasoning

​Model Selection

​Performance Tips

Introduction

Classify API

Use Case: Document Routing in a Processing Pipeline

Use Case: Email Attachment Filtering

Classify vs Split: When to Use Each

Best Practices

Category Definition

Leveraging Reasoning

Model Selection

Performance Tips