Skip to main content
Projects provide a systematic way to test and validate your extraction schemas against known ground truth data. Think of it as evals for document AI. You can measure accuracy, compare different models, and optimize your extraction pipelines with confidence. A project consists of documents with annotations (your test data), iterations (test runs with different settings), and a schema (what you want to extract). This structure lets you run A/B tests between models and systematically improve your document processing accuracy.

How it works

  1. Create an project with your extraction schema
  2. Upload test documents with manually verified ground truth annotations
  3. Run iterations with different model settings (GPT-4o vs GPT-4o-mini, consensus, etc.)
  4. Compare results to find the optimal configuration for your use case
Retab automatically calculates accuracy metrics by comparing each iteration’s output against your ground truth annotations, giving you objective performance measurements.

Schema Optimization Through Projects

One of the most powerful features of projects is schema refinement. When you see poor accuracy on specific fields, you can:
  • Improve descriptions: Make field descriptions more specific and unambiguous
  • Add reasoning prompts: Use X-ReasoningPrompt for complex calculations or logic
  • Refine field types: Adjust data types based on extraction patterns
{
  "type": "object",
  "properties": {
    "amount": {
      "type": "number",
      "description": "The amount"
    },
    "date": {
      "type": "string", 
      "description": "The date"
    }
  }
}
The project workflow for schema optimization:
  1. Run initial project → identify low-accuracy fields
  2. Refine descriptions and add reasoning prompts → re-run project
  3. Compare accuracy improvements → iterate until satisfied
  4. Deploy optimized schema to production

Quick Start

While you can create projects programmatically with the SDK, we recommend using the Retab platform for project management. The web interface provides powerful schema editing tools, visual result comparisons, and collaborative features that make optimization much easier.
Let’s create an project for invoice processing:
from retab import Retab

client = Retab()

# Define what you want to extract
invoice_schema = {
    "type": "object",
    "properties": {
        "invoice_number": {"type": "string"},
        "total_amount": {"type": "number"},
        "vendor_name": {"type": "string"},
        "invoice_date": {"type": "string"}
    },
    "required": ["invoice_number", "total_amount", "vendor_name"]
}

project = client.projects.create(
    name="Invoice Processing Test",
    json_schema=invoice_schema
)

Key Benefits

  1. Objective Measurement: Get precise accuracy scores instead of subjective assessments
  2. Model Comparison: Test different models to find the best fit
  3. Schema Validation: Identify which fields are hardest to extract accurately
  4. Cost Optimization: Balance accuracy against processing costs for your use case

Best Practices

  • Diverse Test Data: Include various document formats, qualities, and edge cases
  • Sufficient Volume: Use at least 5-10 test documents for reliable metrics
  • Ground Truth Quality: Double-check your annotations—bad ground truth leads to misleading results

Deployments

Deployments are project-based configurations for document extraction that can be called via the API route https://api.retab.com/v1/projects/extract/{project_id}/{iteration_id}. This is the primary method for executing document extraction using project-based configurations.
Returns
RetabParsedChatCompletion
The extracted data as a JSON object matching the project’s schema.
from retab import Retab, MIMEData

client = Retab()

# Process a single document
with open("invoice.pdf", "rb") as f:
    mime = MIMEData.from_bytes(f.read(), filename="invoice.pdf")

completion = client.projects.extract(
    project_id="proj_01G34H8J2K",
    iteration_id="iter_01G34H8J2L",  # or "base-configuration" for default settings
    document=mime,
    temperature=0.1,  # Optional override
    seed=42,  # Optional for reproducibility
    store=True  # Whether to store results
)

Parameters

project_id
string
required
ID of the project
iteration_id
string
required
ID of the specific iteration to use, or "base-configuration" to use the project’s default settings.
document
Path | str | bytes | IOBase | MIMEData | PIL.Image.Image | HttpUrl
Single document to process (mutually exclusive with documents).
documents
List[Path | str | bytes | IOBase | MIMEData | PIL.Image.Image | HttpUrl]
List of documents to process (mutually exclusive with document).
temperature
float
Optional temperature override for this specific request. Overrides the default temperature.
seed
int
Optional seed for reproducible results across multiple runs.
store
bool
default:"True"
Whether to store the extraction results for later retrieval and analysis.
Please check the API Reference for complete method documentation.