Introduction

The parse method in Retab’s document processing pipeline converts any document into cleaned, raw markdown text with page-by-page extraction. This endpoint is ideal for extracting cleaned document content to be used as context for downstream processing, such as RAG pipelines, custom ingestion pipelines, embeddings classification, and content indexing workflows. The typical RAG workflow follows these steps:
  1. Parsing: Extract clean text from documents using Retab’s parse method
  2. Chunking: Split the text into manageable blocks (sentences, paragraphs, etc..) called “chunks” for embedding
  3. Indexing: Store chunks in a vector database or any other search index for retrieval
For chunking, we recommend chonkie - a powerful and flexible text chunking library designed specifically for RAG pipelines. Unlike other methods that focus on chat formatting or structured extraction, parse provides:
  • Clean Text Output: Removes formatting artifacts and provides readable text
  • Page-by-Page Processing: Access content from individual pages
  • Flexible Table Formats: Choose how tables are represented (HTML, Markdown, JSON, YAML)
  • OCR Integration: Handles both text-based and image-based documents
  • Batch Processing Ready: Efficient for processing multiple documents

Parse API

ParseRequest
ParseRequest
Returns
ParseResult Object
A ParseResult object containing the extracted text content and processing information.

Use Case: RAG (Retrieval-Augmented Generation) Pipeline Preparation

Prepare documents for RAG applications by extracting and chunking text content.
from retab import Retab
from chonkie import SentenceChunker

client = Retab()

# Parse the document
result = client.documents.parse(
    document="technical-manual.pdf",
    model="gemini-2.5-flash",
    table_parsing_format="markdown",  # Better for RAG
    image_resolution_dpi=150  # Higher quality for technical docs
)

# Initialize chunker for RAG
chunker = SentenceChunker(
    tokenizer_or_token_counter="gpt2",
    chunk_size=512,
    chunk_overlap=128,
    min_sentences_per_chunk=1
)

# Process each page and create chunks
all_chunks = []
for page_num, page_text in enumerate(result.pages, 1):
    chunks = list(chunker(page_text))
    
    for chunk_idx, chunk in enumerate(chunks):
        chunk_data = {
            "page": page_num,
            "chunk_id": f"page_{page_num}_chunk_{chunk_idx}",
            "text": str(chunk),
            "document": result.document.name
        }
        all_chunks.append(chunk_data)

print(f"Created {len(all_chunks)} chunks from {result.usage.page_count} pages")

Best Practices

Model Selection

  • gemini-2.5-pro: Most accurate and robust model, recommended for complex or high-stakes document parsing tasks.
  • gemini-2.5-flash: Best for speed and cost-effectiveness, suitable for most general-purpose documents.
  • gemini-2.5-flash-lite: Fastest and most cost-efficient, ideal for simple documents or high-volume batch processing where maximum throughput is needed.

Image Quality Settings

  • Standard documents: 72-96 DPI
  • Technical documents: 150 DPI
  • Fine print/small text: 300+ DPI

Make files LLM-ready

Retab’s document processing pipeline automatically converts various file types into LLM-ready formats, eliminating the need for custom parsers. This guide explains how to process different document types and understand the resulting output format.

Supported File Types

Retab supports a wide range of document formats:
  • Text Documents: PDF, DOC, DOCX, TXT
  • Spreadsheets: XLS, XLSX, CSV
  • Emails: EML, MSG
  • Images: JPG, PNG, TIFF
  • Presentations: PPT, PPTX
  • And more: HTML, XML, JSON

Create Messages

Converts any document into OpenAI-compatible chat messages. You can choose between different preprocessing parameters according to your needs: modalities (text, image, native) and image settings (dpi, browser_canvas, etc..).
Returns
DocumentMessage Object
A DocumentMessage object with the messages created from the document.
from retab import Retab
from openai import OpenAI

client = Retab()

doc_msg = client.documents.create_messages(
    document = "freight/booking_confirmation.jpg",
    modality = "text",
    image_resolution_dpi = 72,
    browser_canvas = "A4"
)

oai_client = OpenAI()

oai_client.chat.completions.create(
    model="gpt-4.1-mini",
    messages=doc_msg.openai_messages + [
        {
            "role": "user",
            "content": "Summarize the document in 100 words."
        }
    ]
)
Use doc_msg.items to have a list of [PIL.Image.Image | str] objects

Create Inputs

Converts any document and a json schema into OpenAI-compatible responses input. You can choose between different preprocessing parameters according to your needs: modalities (text, image, native) and image settings (dpi, browser_canvas, etc..).
Returns
DocumentMessage Object
A DocumentMessage object with the document content structured according to the provided JSON schema.
from retab import Retab

client = Retab()

doc_input = client.documents.create_inputs(
    document = "freight/invoice.pdf",
    json_schema = {
        "properties": {
            "invoice_number": {
                "type": "string",
                "description": "The invoice number"
            },
            "total_amount": {
                "type": "number",
                "description": "The total invoice amount"
            },
            "issue_date": {
                "type": "string",
                "description": "The date the invoice was issued"
            }
        },
        "required": ["invoice_number", "total_amount", "issue_date"]
    },
    modality = "text",
    image_resolution_dpi = 72,
    browser_canvas = "A4"
)