The Schema is the blueprint for structured data extraction—a formal specification that defines the exact shape, types, and validation rules for the data you want to extract from documents. Think of it as a contract between you and the AI: you define what fields you need (invoice amounts, customer names, dates), their data types (string, number, boolean), and any constraints (required fields, format patterns), and the AI delivers perfectly structured data that matches your specification every time. By using JSON Schema, you transform unpredictable document parsing into reliable, type-safe data pipelines that integrate seamlessly with your applications, databases, and workflows.
The Generate Schema endpoint allows you to automatically generate a JSON Schema from a set of example documents. This is particularly useful when you want to create a schema that captures all the important fields and patterns present in your documents.
You can provide multiple documents to ensure the generated schema covers all possible variations in your data structure. The AI will analyze the documents and create a comprehensive schema with appropriate field descriptions and validation rules.
Copy
from retab import Retabreclient = Retab()schema_obj = reclient.schemas.generate( modality = "native", model = "gpt-4.1", temperature = 0, stream = False, documents = [ "freight/booking_confirmation_1.jpg", "freight/booking_confirmation_2.jpg" ])
The Schema class turns a design-time schema that you supply (JSON Schema or Pydantic model) into all the artefacts required to obtain, validate and post-process a large-language-model (LLM) extraction:
Returns the messages formatted for Google’s Gemini API.
Copy
{ "X-SystemPrompt": "You are a useful assistant extracting information from documents.", "properties": { "name": { "description": "The name of the calendar event.", "title": "Name", "type": "string" }, "date": { "X-ReasoningPrompt": "The user can mention it in any format, like **next week** or **tomorrow**. Infer the right date format from the user input.", "description": "The date of the calendar event in ISO 8601 format.", "title": "Date", "type": "string" } }, "required": [ "name", "date" ], "title": "CalendarEvent", "type": "object"}
Ingests a schema specifying the data structure you want to extract (JSON Schema or Pydantic Model).
Produces A system prompt and unfolds the reasoning fields into a new data structure that will be used when calling the LLM.
This design allows to separate the inference logic from the clean business object you ultimately use. It minimises boilerplate while keeping every transformation explicit and auditable.
from retab import Schemaschema_obj = Schema( json_schema = { 'X-SystemPrompt': 'You are a useful assistant.', 'properties': { 'name': { 'description': 'The name of the calendar event.', 'title': 'Name', 'type': 'string' }, 'date': { 'description': 'The date of the calendar event in ISO 8601 format.', 'title': 'Date', 'type': 'string' } }, 'required': ['name', 'date'], 'title': 'CalendarEvent', 'type': 'object' })
from retab import Schemafrom pydantic import BaseModel, Field, ConfigDictclass CalendarEvent(BaseModel): model_config = ConfigDict(json_schema_extra = {"X-SystemPrompt": "You are a useful assistant."}) name: str = Field(..., description="The name of the calendar event." ) date: str = Field(..., description="The date of the calendar event in ISO 8601 format.", )schema_obj = Schema( pydantic_model = CalendarEvent)