Schema - Retab Docs

The Schema is the blueprint for structured data extraction—a formal specification that defines the exact shape, types, and validation rules for the data you want to extract from documents. Think of it as a contract between you and the AI: you define what fields you need (invoice amounts, customer names, dates), their data types (string, number, boolean), and any constraints (required fields, format patterns), and the AI delivers perfectly structured data that matches your specification every time. By using JSON Schema, you transform unpredictable document parsing into reliable, type-safe data pipelines that integrate seamlessly with your applications, databases, and workflows. Please check the API Reference for more details.

Generate Schema

The Generate Schema endpoint allows you to automatically generate a JSON Schema from a set of example documents. This is particularly useful when you want to create a schema that captures all the important fields and patterns present in your documents. You can provide multiple documents to ensure the generated schema covers all possible variations in your data structure. The AI will analyze the documents and create a comprehensive schema with appropriate field descriptions and validation rules.

from retab import Retab

reclient = Retab()

schema_obj = reclient.schemas.generate(
    modality = "native",
    model = "gpt-4.1",
    temperature = 0,
    stream = False,
    documents = [
        "freight/booking_confirmation_1.jpg",
        "freight/booking_confirmation_2.jpg"
    ]
)

The Schema Object

The Schema class turns a design-time schema that you supply (JSON Schema or Pydantic model) into all the artefacts required to obtain, validate and post-process a large-language-model (LLM) extraction:

A system prompt
An inference-time schema

Schema Object

object

A Schema object represents a JSON schema for structured data extraction.

Show properties

string

A unique identifier for the schema, prefixed by “sch_id_”

object

string

The type of object being preprocessed. Always “schema”.

created_at

datetime

The timestamp when the schema was created.

json_schema

dict[str, Any]

The JSON schema to use for loading.

pydantic_model

type[BaseModel]

The Pydantic model to use for loading.

inference_json_schema

dict[str, Any]

Returns the schema formatted for inference, with the FieldPrompt and ReasoningPrompt fields added.

inference_pydantic_model

type[BaseModel]

Converts the inference schema to a Pydantic model, with the FieldPrompt and ReasoningPrompt fields added.

inference_gemini_json_schema

dict[str, Any]

Returns the schema formatted for inference for Gemini, with the FieldPrompt and ReasoningPrompt fields added.

data_id

string

SHA1 hash of the schema data, ignoring prompt/description/default fields, prefixed by “sch_data_id_”

title

string

The title of the schema. Returns “NoTitle” if not specified.

openai_messages

array[ChatCompletionMessageParam]

Returns the messages formatted for OpenAI’s API.

anthropic_system_prompt

string | NotGiven

Returns the system message in Anthropic’s Claude format.

anthropic_messages

array[MessageParam]

Returns the messages in Anthropic’s Claude format.

gemini_messages

ContentsType

Returns the messages formatted for Google’s Gemini API.

{
  "properties": {
    "name": {
      "description": "The name of the calendar event.",
      "title": "Name",
      "type": "string"
    },
    "date": {
      "X-ReasoningPrompt": "The user can mention it in any format, like **next week** or **tomorrow**. Infer the right date format from the user input.",
      "description": "The date of the calendar event in ISO 8601 format.",
      "title": "Date",
      "type": "string"
    }
  },
  "required": [
    "name",
    "date"
  ],
  "title": "CalendarEvent",
  "type": "object"
}

Introduction

Schema offers a single abstraction that:

Ingests a schema specifying the data structure you want to extract (JSON Schema or Pydantic Model).
Produces A system prompt and unfolds the reasoning fields into a new data structure that will be used when calling the LLM.

This design allows to separate the inference logic from the clean business object you ultimately use. It minimises boilerplate while keeping every transformation explicit and auditable.

1. Architectural Overview

Schema is the bridge that keeps those two views in sync.

Authoring view — the exact schema you provided, held in Schema.json_schema and suitable for documentation or version control.
Inference view — an enhanced schema plus a comprehensive system prompt, used only when interacting with the LLM.

2. Typical Usage Pattern

from retab import Schema
from openai import OpenAI
from mymodels import Invoice          # your Pydantic BaseModel

# 1. Define the schema once
schema = Schema(pydantic_model=Invoice)   # or json_schema={...}

# 2. Call the LLM
client = OpenAI(api_key="…")
response = client.chat.completions.create(
    model="gpt-4o",
    messages=schema.openai_messages + doc_msg.openai_messages,
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": schema.id,
            "schema": schema.inference_json_schema,
            "strict": True,
        },
    },
)

# 3. Validate and strip reasoning fields
from main_server.libs.document_analysis.functions.json_schema  import filter_auxiliary_fields_json
data = filter_auxiliary_fields_json(response.choices[0].message.content)
invoice = Invoice.model_validate(data)

3. Life-Cycle of a Response

Step	Validator / Schema	Goal
LLM output	`inference_json_schema`	Ensure the model produces both data and detailed reasoning.
Post-process	`filter_auxiliary_fields_json()`	Remove the reasoning keys.
Application object	Original Pydantic model (`Invoice` in the example)	Validate the cleaned payload and obtain a type-safe object.

Load Schema from JSON Schema

from retab import Schema

schema_obj = Schema(
    json_schema = {
      'properties': {
          'name': {
              'description': 'The name of the calendar event.',
              'title': 'Name',
              'type': 'string'
          },
          'date': {
              'description': 'The date of the calendar event in ISO 8601 format.',
              'title': 'Date',
              'type': 'string'
          }
      },
      'required': ['name', 'date'],
      'title': 'CalendarEvent',
      'type': 'object'
  }
)

Load Schema from Pydantic BaseModel

from retab import Schema
from pydantic import BaseModel, Field, ConfigDict

class CalendarEvent(BaseModel):
    name: str = Field(...,
        description="The name of the calendar event."
    )
    date: str = Field(...,
        description="The date of the calendar event in ISO 8601 format.",
    )

schema_obj = Schema(
    pydantic_model = CalendarEvent
)

Overview

Core Concepts

​Generate Schema

​The Schema Object

​Introduction

​1. Architectural Overview

​2. Typical Usage Pattern

​3. Life-Cycle of a Response

​Load Schema from JSON Schema

​Load Schema from Pydantic BaseModel

Generate Schema

The Schema Object

Introduction

1. Architectural Overview

2. Typical Usage Pattern

3. Life-Cycle of a Response

Load Schema from JSON Schema

Load Schema from Pydantic BaseModel