Introduction
Theparse
method in Retab’s document processing pipeline converts any document into cleaned, raw markdown text with page-by-page extraction. This endpoint is ideal for extracting cleaned document content to be used as context for downstream processing, such as RAG pipelines, custom ingestion pipelines, embeddings classification, and content indexing workflows.
The typical RAG workflow follows these steps:
- Parsing: Extract clean text from documents using Retab’s
parse
method - Chunking: Split the text into manageable blocks (sentences, paragraphs, etc..) called “chunks” for embedding
- Indexing: Store chunks in a vector database or any other search index for retrieval
parse
provides:
- Clean Text Output: Removes formatting artifacts and provides readable text
- Page-by-Page Processing: Access content from individual pages
- Flexible Table Formats: Choose how tables are represented (HTML, Markdown, JSON, YAML)
- OCR Integration: Handles both text-based and image-based documents
- Batch Processing Ready: Efficient for processing multiple documents
Parse API
A ParseResult object containing the extracted text content and processing information.
Use Case: RAG (Retrieval-Augmented Generation) Pipeline Preparation
Prepare documents for RAG applications by extracting and chunking text content.Best Practices
Model Selection
gemini-2.5-pro
: Most accurate and robust model, recommended for complex or high-stakes document parsing tasks.gemini-2.5-flash
: Best for speed and cost-effectiveness, suitable for most general-purpose documents.gemini-2.5-flash-lite
: Fastest and most cost-efficient, ideal for simple documents or high-volume batch processing where maximum throughput is needed.
Image Quality Settings
- Standard documents: 96 DPI
- Technical documents: 150 DPI
- Fine print/small text: 300+ DPI