Skip to main content

Introduction

The parse method in Retab’s document processing pipeline converts any document into cleaned, raw markdown text with page-by-page extraction. This endpoint is ideal for extracting cleaned document content to be used as context for downstream processing, such as RAG pipelines, custom ingestion pipelines, embeddings classification, and content indexing workflows. The typical RAG workflow follows these steps:
  1. Parsing: Extract clean text from documents using Retab’s parse method
  2. Chunking: Split the text into manageable blocks (sentences, paragraphs, etc..) called “chunks” for embedding
  3. Indexing: Store chunks in a vector database or any other search index for retrieval
For chunking, we recommend chonkie - a powerful and flexible text chunking library designed specifically for RAG pipelines. Unlike other methods that focus on chat formatting or structured extraction, parse provides:
  • Clean Text Output: Removes formatting artifacts and provides readable text
  • Page-by-Page Processing: Access content from individual pages
  • Flexible Table Formats: Choose how tables are represented (HTML, Markdown, JSON, YAML)
  • OCR Integration: Handles both text-based and image-based documents
  • Batch Processing Ready: Efficient for processing multiple documents

Parse API

ParseRequest
ParseRequest
Returns
ParseResult Object
A ParseResult object containing the extracted text content and processing information.

Use Case: RAG (Retrieval-Augmented Generation) Pipeline Preparation

Prepare documents for RAG applications by extracting and chunking text content.
from retab import Retab
from chonkie import SentenceChunker

client = Retab()

# Parse the document
result = client.documents.parse(
    document="technical-manual.pdf",
    model="gemini-2.5-flash",
    table_parsing_format="markdown",  # Better for RAG
    image_resolution_dpi=150  # Higher quality for technical docs
)

# Initialize chunker for RAG
chunker = SentenceChunker(
    tokenizer_or_token_counter="gpt2",
    chunk_size=512,
    chunk_overlap=128,
    min_sentences_per_chunk=1
)

# Process each page and create chunks
all_chunks = []
for page_num, page_text in enumerate(result.pages, 1):
    chunks = list(chunker(page_text))
    
    for chunk_idx, chunk in enumerate(chunks):
        chunk_data = {
            "page": page_num,
            "chunk_id": f"page_{page_num}_chunk_{chunk_idx}",
            "text": str(chunk),
            "document": result.document.name
        }
        all_chunks.append(chunk_data)

print(f"Created {len(all_chunks)} chunks from {result.usage.page_count} pages")

Best Practices

Model Selection

  • gemini-2.5-pro: Most accurate and robust model, recommended for complex or high-stakes document parsing tasks.
  • gemini-2.5-flash: Best for speed and cost-effectiveness, suitable for most general-purpose documents.
  • gemini-2.5-flash-lite: Fastest and most cost-efficient, ideal for simple documents or high-volume batch processing where maximum throughput is needed.

Image Quality Settings

  • Standard documents: 96 DPI
  • Technical documents: 150 DPI
  • Fine print/small text: 300+ DPI