Sources

When extracting data from documents, it’s often important to track where each piece of information came from. Source quotes help by capturing the exact text from the document that supports each extracted value. Source quotes use a special JSON Schema annotation to create auxiliary fields that store verbatim quotes from the source document:

X-SourceQuote - Generates source quote fields alongside data fields, capturing the exact text from the document that justifies each extracted value.

This approach improves traceability and verification while keeping your original schema structure intact.

Source Quote

A X-SourceQuote: true tag in the schema generates a source quote field alongside the data field. This is particularly useful for auditing, compliance, or when you need to verify where data came from. Example: Contract Information Extraction

Contract.md

This Service Agreement ("Agreement") is entered into as of March 15, 2024.

PARTIES:
- Provider: TechCorp Solutions Inc., a Delaware corporation
- Client: Global Industries LLC, with principal offices at 123 Main Street, Boston, MA 02101

TERMS:
The initial term of this Agreement shall be twenty-four (24) months, commencing on April 1, 2024.
The monthly service fee shall be $15,000 USD, payable within 30 days of invoice.

Let’s say we want to extract contract details and track exactly where each value was found in the document:

from pydantic import BaseModel, Field
from datetime import date

# You can define the custom annotations in the `pydantic.Field` class using the `json_schema_extra` field.

class ContractInfo(BaseModel):
    provider_name: str = Field(...,
        description="Name of the service provider",
        json_schema_extra={
            "X-SourceQuote": True,
        }
    )
    contract_value: float = Field(...,
        description="Monthly contract value in USD",
        json_schema_extra={
            "X-SourceQuote": True,
        }
    )
    start_date: date
    duration_months: int

# If you need a json_schema, you can call ContractInfo.model_json_schema()

With source quotes enabled, the extraction includes the exact text from the document that supports each value:

{
  "source___provider_name": "Provider: TechCorp Solutions Inc., a Delaware corporation",
  "provider_name": "TechCorp Solutions Inc.",
  "source___contract_value": "The monthly service fee shall be $15,000 USD",
  "contract_value": 15000,
  "start_date": "2024-04-01",
  "duration_months": 24
}

As you can see, the source___ fields capture the exact verbatim text from the document, allowing you to verify each extracted value against its source.

Key Benefits

Traceability: Know exactly where each piece of extracted data came from
Verification: Users can verify extracted values against the original text
Compliance: Meet audit requirements by documenting data provenance
Debugging: Quickly identify extraction errors by comparing source quotes with extracted values
Trust: Build confidence in automated extractions with transparent sourcing

Best Practices

Use source quotes for fields that require verification or audit trails
Apply to high-value or legally significant data points
Combine with Reasoning for complex extractions that involve both calculation and source verification
Source quotes work best on leaf fields (strings, numbers, dates) rather than nested objects

Overview

Core Concepts

Consensus

Source Quote

Key Benefits

Best Practices

Go further

Overview

Core Concepts

Consensus

​Source Quote

​Key Benefits

​Best Practices

​Go further

Source Quote

Key Benefits

Best Practices

Go further