sources returns the same data structure with every leaf value wrapped as {value, source}, where source points back to the exact location in the original document where the data was found.
This is useful for:
- Citation display in review UIs
- Provenance audit and debugging
- Source highlighting in document viewers
- Downstream review workflows that need to verify extracted data
Getting Sources
Retrieve sourced provenance for an extraction by its ID. The response preserves the exact shape of the extraction output, but each leaf value is replaced with a{value, source} object.
Parameters
The ID of the extraction to source.
Response Shape
The response contains both the original extraction and a sourced version with identical structure:Sourced Leaf
Every scalar value (string, number, boolean, null) in the extraction becomes:The extracted value (what the model produced). May differ from
content if the model normalized or reformatted the data.Source citation pointing to the document location.
null if provenance could not be determined.Source Object
The verbatim text found in the document corresponding to this value.
Format-specific source locator. The
kind field determines the anchor type. See Anchor Types below.Anchor Types
Theanchor field uses a discriminated union on kind. The anchor type depends on the document format.
PDF (pdf_bbox)
Normalized bounding box on a PDF page. Coordinates are in [0, 1] with origin at top-left.
1-based page number.
Normalized left coordinate [0, 1].
Normalized top coordinate [0, 1].
Normalized width [0, 1].
Normalized height [0, 1].
Image (image_bbox)
Same as PDF bbox but for image documents.
Normalized left coordinate [0, 1].
Normalized top coordinate [0, 1].
Normalized width [0, 1].
Normalized height [0, 1].
CSV (csv_cell)
Cell reference for CSV files.
1-based row number.
Column letter (A, B, AA).
Cell coordinate (e.g. A12, B5).
Spreadsheet (spreadsheet_cell)
Cell reference for XLSX files. Extends CSV cell with sheet information.
0-based sheet index.
Sheet name.
1-based row number.
Column letter (A, B, AA).
Cell coordinate (e.g. A12, B5).
DOCX Paragraph (docx_text_span)
Text span within a DOCX paragraph.
0-based paragraph index within the document.
Start character offset within the paragraph text.
End character offset within the paragraph text.
Raw OOXML of the matched paragraph element. Can be used to locate the exact element in the
.docx archive.DOCX Table Cell (docx_table_cell)
Cell reference within a DOCX table.
0-based table index.
0-based row index.
0-based column index.
Start character offset within the cell text.
End character offset within the cell text.
Raw OOXML of the matched table cell element. Can be used to locate the exact element in the
.docx archive.Plain Text (text_span)
Line and character offsets for plain text files (.txt, .md, .json, etc.).
1-based start line number.
1-based end line number.
Start character offset within the line.
End character offset within the line.
Supported Document Types
| Document Type | Anchor Kind | Description |
|---|---|---|
pdf | pdf_bbox | Bounding box on a PDF page |
image | image_bbox | Bounding box on an image |
xlsx | spreadsheet_cell | Cell reference in a spreadsheet |
csv | csv_cell | Cell reference in a CSV file |
docx | docx_text_span / docx_table_cell | Paragraph or table cell reference |
txt | text_span | Line/character offsets in the text file |
Edge Cases
- Null values: Wrapped as
{"value": null, "source": null}. - Empty arrays/objects: Preserved unchanged (not wrapped).
- Unsourced fields: If a field cannot be located in the document,
sourceisnull. The endpoint still returns 200 with partial results.