Skip to main content
When extracting data from documents, it’s often important to track where each piece of information came from. Source quotes help by capturing the exact text from the document that supports each extracted value. Source quotes use a special JSON Schema annotation to create auxiliary fields that store verbatim quotes from the source document:
  • X-SourceQuote - Generates source quote fields alongside data fields, capturing the exact text from the document that justifies each extracted value.
This approach improves traceability and verification while keeping your original schema structure intact.

Source Quote

A X-SourceQuote: true tag in the schema generates a source quote field alongside the data field. This is particularly useful for auditing, compliance, or when you need to verify where data came from. Example: Contract Information Extraction
Contract.md
This Service Agreement ("Agreement") is entered into as of March 15, 2024.

PARTIES:
- Provider: TechCorp Solutions Inc., a Delaware corporation
- Client: Global Industries LLC, with principal offices at 123 Main Street, Boston, MA 02101

TERMS:
The initial term of this Agreement shall be twenty-four (24) months, commencing on April 1, 2024.
The monthly service fee shall be $15,000 USD, payable within 30 days of invoice.
Let’s say we want to extract contract details and track exactly where each value was found in the document:
from pydantic import BaseModel, Field
from datetime import date

# You can define the custom annotations in the `pydantic.Field` class using the `json_schema_extra` field.

class ContractInfo(BaseModel):
    provider_name: str = Field(...,
        description="Name of the service provider",
        json_schema_extra={
            "X-SourceQuote": True,
        }
    )
    contract_value: float = Field(...,
        description="Monthly contract value in USD",
        json_schema_extra={
            "X-SourceQuote": True,
        }
    )
    start_date: date
    duration_months: int

# If you need a json_schema, you can call ContractInfo.model_json_schema()
With source quotes enabled, the extraction includes the exact text from the document that supports each value:
{
  "source___provider_name": "Provider: TechCorp Solutions Inc., a Delaware corporation",
  "provider_name": "TechCorp Solutions Inc.",
  "source___contract_value": "The monthly service fee shall be $15,000 USD",
  "contract_value": 15000,
  "start_date": "2024-04-01",
  "duration_months": 24
}
As you can see, the source___ fields capture the exact verbatim text from the document, allowing you to verify each extracted value against its source.

Key Benefits

  1. Traceability: Know exactly where each piece of extracted data came from
  2. Verification: Users can verify extracted values against the original text
  3. Compliance: Meet audit requirements by documenting data provenance
  4. Debugging: Quickly identify extraction errors by comparing source quotes with extracted values
  5. Trust: Build confidence in automated extractions with transparent sourcing

Best Practices

  • Use source quotes for fields that require verification or audit trails
  • Apply to high-value or legally significant data points
  • Combine with Reasoning for complex extractions that involve both calculation and source verification
  • Source quotes work best on leaf fields (strings, numbers, dates) rather than nested objects

Go further