Skip to main content

What is Retab?

Retab solves all the major challenges in data processing with Large Language Models:
  1. Parsing: Convert any file type (PDFs, Excel, emails, etc.) into LLM-ready format without writing custom parsers
  2. Extraction: Get consistent, reliable outputs using schema-based prompt engineering
  3. Projects: Evaluate the performance of models against annotated datasets
  4. Deployments: Publish a live, stable, shareable document processor from your project
Our goal is to make the process of analyzing documents and unstructured data as easy and transparent as possible. We are offering you all the software-defined primitives to build your own document processing solutions. We see it as Stripe for document processing.

A new, lighter paradigm

Large Language Models collapse entire layers of legacy OCR pipelines into a single, elegant abstraction. When a model can read, reason, and structure text natively, we no longer need brittle heuristics, handcrafted parsers, or heavyweight ETL jobs. Instead, we can expose a small, principled API: input your document, define the output schema, and receive reliable structured data. This reduces complexity, improves accuracy, speeds up processing, and lowers costs. By building around LLMs from the ground up, we shift the focus from tedious infrastructure to extracting meaningful answers from your data. Many people haven’t yet realized how powerful LLMs have become at document processing tasks. We believe that LLMs and structured generation are among the most impactful breakthroughs of the 21st century. AI is the new electricity, and retab is here to help you tame it.

Structured Generation

JSON is one of the most widely used formats in the world for applications to exchange data. Structured Generation is a feature that ensures the AI model will always generate responses that adhere to your supplied JSON Schema, so you don’t need to worry about the model omitting a required key, or hallucinating an invalid enum value.
Every LLM service providers native structured generation support.
from pydantic import BaseModel
from openai import OpenAI

client = OpenAI()

class ResearchPaperExtraction(BaseModel):
    title: str
    authors: list[str]
    abstract: str
    keywords: list[str]

completion = client.completions.parse(
    json_schema=ResearchPaperExtraction.model_json_schema(),
    messages=[
        {"role": "system", "content": "You are an expert at structured data extraction. You will be given unstructured text from a research paper and should convert it into the given structure."},
        {"role": "user", "content": "..."}
    ],
    model="gpt-4.1",
    temperature=0
)
Usage involves defining a schema for your desired output and including it in your API request. The schema can be a JSON Schema document or a data model class (like Pydantic BaseModel) that SDKs convert to JSON Schema. The LLM generates responses conforming to that schema, eliminating the need for post-processing or complex prompt engineering.

Community

Let’s create the future of document processing together! Join our discord community to share tips, discuss best practices, and showcase what you build. Or just tweet at us. We can’t wait to see how you’ll use Retab.

Roadmap

We share our roadmap publicly. Please submit your feature requests on Github Among the features we’re working on:
  • Schema optimization autopilot
  • Sources API
  • Document Edit API

Learn More