Skip to main content
Chunking is an advanced extraction mode that dramatically speeds up processing for documents containing repeating data structures, such as invoices with multiple line items, tables with many rows, or any document with arrays of similar objects.

How It Works

Instead of processing an entire document in a single LLM call, Chunking:
  1. Identifies array structures in your JSON schema (e.g., line_items, transactions, entries)
  2. Extracts unique keys from the document (e.g., product names, invoice numbers, dates) that identify each item
  3. Segments the document by locating where each key appears
  4. Processes items concurrently - each array item is extracted in parallel using dedicated LLM calls
  5. Merges results into the final structured output while preserving order
This approach is particularly effective for multi-page documents where items span across pages, as each segment can be processed independently.

Context Engineering: Why Chunking is More Accurate

Beyond speed, Chunking significantly improves extraction accuracy through Context Engineering — the practice of optimizing what context the LLM sees for each extraction task. When processing a 50-page invoice with 200 line items in a single LLM call, the model must:
  • Hold the entire document in its context window
  • Track hundreds of items simultaneously
  • Maintain attention across thousands of tokens
  • Avoid confusing similar items that appear pages apart
This leads to common failure modes: missed items, values assigned to wrong rows, and degraded accuracy toward the end of long documents. Chunking solves this by providing focused, relevant context for each item:
AspectStandard ExtractionChunking
Context per itemEntire document (all pages)Only the relevant segment
Noise levelHigh (hundreds of unrelated items)Minimal (just the target item)
Attention dilutionSignificant on long documentsNone — laser-focused extraction
Position biasLater items often less accurateEqual accuracy for all items
By cropping the document to show only the region containing each specific item, the LLM can dedicate its full attention and reasoning capacity to extracting that single item correctly. This is the same principle behind RAG (Retrieval-Augmented Generation) — less noise, more signal, better results.

When to Use Chunking

Chunking is ideal when:
  • Your schema contains arrays of objects (e.g., line_items: [{sku, description, quantity, price}])
  • Documents have many repeating items (10+ items benefit most)
  • You need faster turnaround on large documents
  • Items can be uniquely identified by a key field

Usage

Enable Chunking by specifying the chunking_keys parameter in your extraction request:
from retab import Retab

client = Retab()

response = client.documents.extract(
    document="invoice.pdf",
    json_schema=my_schema,
    chunking_keys={
        "line_items": "product_name"  # parent_path: child_key_path
    }
)
The chunking_keys parameter is a dictionary mapping:
  • Key: The path to the array in your schema (e.g., "line_items", "transactions", "items.products")
  • Value: The field within each array item that uniquely identifies it (e.g., "product_name", "sku", "id")

Example Schema

{
  "type": "object",
  "properties": {
    "invoice_number": { "type": "string" },
    "date": { "type": "string" },
    "line_items": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "product_name": { "type": "string" },
          "quantity": { "type": "number" },
          "unit_price": { "type": "number" },
          "total": { "type": "number" }
        }
      }
    },
    "total_amount": { "type": "number" }
  }
}
With chunking_keys={"line_items": "product_name"}, Retab will:
  1. Extract all product names from the document
  2. Locate each product’s position in the document
  3. Extract each line item’s details in parallel
  4. Merge results and extract constants (invoice_number, date, total_amount) separately

Supported Document Types

  • PDF documents
  • Images (JPEG, PNG, etc.)
  • Office documents (DOCX, PPTX, ODT, ODP)
  • Excel spreadsheets (XLSX, XLS)

Performance Benefits

Document SizeStandard ExtractionChunking
5 line items~3s~3s
20 line items~8s~4s
50 line items~15s~5s
100+ items~30s+~6s
Times are approximate and vary based on document complexity and model used.

Consensus with Chunking

Chunking fully supports the n_consensus parameter. When enabled, each item is extracted multiple times independently, and results are compared to improve accuracy. This is particularly useful for:
  • High-value documents requiring verification
  • Documents with challenging handwriting or scan quality
  • Compliance-critical extractions
response = client.documents.extract(
    document="invoice.pdf",
    json_schema=my_schema,
    chunking_keys={"line_items": "sku"},
    n_consensus=3  # Each item extracted 3 times for verification
)

Pricing

Chunking uses more credits than standard extraction because it processes the document in multiple chunks. It charges 2x the base rate per page.
total_credits = (2 × document_page_count) × n_consensus × model_credits

Cost Breakdown

ComponentCost
Billed pages2 × document_page_count
Credits per billed pagen_consensus × model_credits
Total(2 × document_page_count) × n_consensus × model_credits

Example

For a 10-page invoice using retab-small (1.0 credit) and n_consensus=1:
  • Billed pages: 2 × 10 = 20
  • Credits per billed page: 1 × 1.0 = 1.0
  • Total: 20 × 1.0 = 20 credits
If n_consensus=3, the same document costs:
  • 20 × (3 × 1.0) = 60 credits
For detailed pricing information, see the Pricing documentation.