Skip to main content
Sources provide per-field provenance for extraction results. Given an extraction, sources returns the same data structure with every leaf value wrapped as {value, source}, where source points back to the exact location in the original document where the data was found. This is useful for:
  • Citation display in review UIs
  • Provenance audit and debugging
  • Source highlighting in document viewers
  • Downstream review workflows that need to verify extracted data

Getting Sources

Retrieve sourced provenance for an extraction by its ID. The response preserves the exact shape of the extraction output, but each leaf value is replaced with a {value, source} object.
from retab import Retab

client = Retab()

result = client.extractions.sources("extr_01G34H8J2K")

# Access the original extraction
print(result.extraction)

# Access the sourced version (same shape, leaves wrapped with source info)
print(result.sources)

Parameters

extraction_id
string
required
The ID of the extraction to source.

Response Shape

The response contains both the original extraction and a sourced version with identical structure:
{
  "object": "extraction.sources",
  "extraction_id": "extr_01G34H8J2K",
  "file": {
    "id": "file_abc123",
    "filename": "invoice_001.pdf",
    "mime_type": "application/pdf"
  },
  "document_type": "pdf",
  "extraction": {
    "invoice_number": "INV-1032",
    "customer": {
      "name": "Acme Inc.",
      "email": "billing@acme.com"
    },
    "line_items": [
      {
        "description": "Widget A",
        "quantity": 2
      }
    ]
  },
  "sources": {
    "invoice_number": {
      "value": "INV-1032",
      "source": {
        "content": "INV-1032",
        "anchor": {
          "kind": "pdf_bbox",
          "page": 1,
          "left": 0.60,
          "top": 0.12,
          "width": 0.25,
          "height": 0.03
        }
      }
    },
    "customer": {
      "name": {
        "value": "Acme Inc.",
        "source": { "..." : "..." }
      },
      "email": {
        "value": "billing@acme.com",
        "source": { "..." : "..." }
      }
    },
    "line_items": [
      {
        "description": {
          "value": "Widget A",
          "source": { "..." : "..." }
        },
        "quantity": {
          "value": 2,
          "source": { "..." : "..." }
        }
      }
    ]
  }
}

Sourced Leaf

Every scalar value (string, number, boolean, null) in the extraction becomes:
{
  "value": "INV-1032",
  "source": {
    "content": "INV-1032",
    "anchor": { "kind": "pdf_bbox", "..." : "..." }
  }
}
value
any
The extracted value (what the model produced). May differ from content if the model normalized or reformatted the data.
source
Source | null
Source citation pointing to the document location. null if provenance could not be determined.

Source Object

content
string
The verbatim text found in the document corresponding to this value.
anchor
SourceAnchor
Format-specific source locator. The kind field determines the anchor type. See Anchor Types below.

Anchor Types

The anchor field uses a discriminated union on kind. The anchor type depends on the document format.

PDF (pdf_bbox)

Normalized bounding box on a PDF page. Coordinates are in [0, 1] with origin at top-left.
{
  "kind": "pdf_bbox",
  "page": 1,
  "left": 0.60,
  "top": 0.75,
  "width": 0.35,
  "height": 0.12
}
page
int
1-based page number.
left
float
Normalized left coordinate [0, 1].
top
float
Normalized top coordinate [0, 1].
width
float
Normalized width [0, 1].
height
float
Normalized height [0, 1].

Image (image_bbox)

Same as PDF bbox but for image documents.
{
  "kind": "image_bbox",
  "left": 0.15,
  "top": 0.30,
  "width": 0.40,
  "height": 0.08
}
left
float
Normalized left coordinate [0, 1].
top
float
Normalized top coordinate [0, 1].
width
float
Normalized width [0, 1].
height
float
Normalized height [0, 1].

CSV (csv_cell)

Cell reference for CSV files.
{
  "kind": "csv_cell",
  "row": 12,
  "column": "A",
  "coordinate": "A12"
}
row
int
1-based row number.
column
string
Column letter (A, B, AA).
coordinate
string
Cell coordinate (e.g. A12, B5).

Spreadsheet (spreadsheet_cell)

Cell reference for XLSX files. Extends CSV cell with sheet information.
{
  "kind": "spreadsheet_cell",
  "sheet_index": 0,
  "sheet_name": "Customers",
  "row": 12,
  "column": "A",
  "coordinate": "A12"
}
sheet_index
int
0-based sheet index.
sheet_name
string
Sheet name.
row
int
1-based row number.
column
string
Column letter (A, B, AA).
coordinate
string
Cell coordinate (e.g. A12, B5).

DOCX Paragraph (docx_text_span)

Text span within a DOCX paragraph.
{
  "kind": "docx_text_span",
  "paragraph": 14,
  "char_start": 15,
  "char_end": 24,
  "xml": "<w:p>...</w:p>"
}
paragraph
int
0-based paragraph index within the document.
char_start
int
Start character offset within the paragraph text.
char_end
int
End character offset within the paragraph text.
xml
string
Raw OOXML of the matched paragraph element. Can be used to locate the exact element in the .docx archive.

DOCX Table Cell (docx_table_cell)

Cell reference within a DOCX table.
{
  "kind": "docx_table_cell",
  "table": 2,
  "row": 4,
  "column": 1,
  "char_start": 0,
  "char_end": 12,
  "xml": "<w:tc>...</w:tc>"
}
table
int
0-based table index.
row
int
0-based row index.
column
int
0-based column index.
char_start
int
Start character offset within the cell text.
char_end
int
End character offset within the cell text.
xml
string
Raw OOXML of the matched table cell element. Can be used to locate the exact element in the .docx archive.

Plain Text (text_span)

Line and character offsets for plain text files (.txt, .md, .json, etc.).
{
  "kind": "text_span",
  "line_start": 42,
  "line_end": 42,
  "char_start": 15,
  "char_end": 24
}
line_start
int
1-based start line number.
line_end
int
1-based end line number.
char_start
int
Start character offset within the line.
char_end
int
End character offset within the line.

Supported Document Types

Document TypeAnchor KindDescription
pdfpdf_bboxBounding box on a PDF page
imageimage_bboxBounding box on an image
xlsxspreadsheet_cellCell reference in a spreadsheet
csvcsv_cellCell reference in a CSV file
docxdocx_text_span / docx_table_cellParagraph or table cell reference
txttext_spanLine/character offsets in the text file

Edge Cases

  • Null values: Wrapped as {"value": null, "source": null}.
  • Empty arrays/objects: Preserved unchanged (not wrapped).
  • Unsourced fields: If a field cannot be located in the document, source is null. The endpoint still returns 200 with partial results.

Use Case: Building a Review UI

Use sources to highlight extracted values in the original document for human review.
from retab import Retab

client = Retab()

# 1. Extract data from a document
result = client.documents.extract(
    document="invoices/invoice_001.pdf",
    model="retab-small",
    json_schema=my_schema,
)

# 2. Get sources for the extraction
sources = client.extractions.sources(result.extraction_id)

# 3. Use anchors to highlight values in a PDF viewer
for field, leaf in sources["sources"].items():
    if isinstance(leaf, dict) and "source" in leaf and leaf["source"]:
        anchor = leaf["source"]["anchor"]
        if anchor["kind"] == "pdf_bbox":
            print(f"{field}: page {anchor['page']}, "
                  f"bbox ({anchor['left']:.2f}, {anchor['top']:.2f}, "
                  f"{anchor['width']:.2f}, {anchor['height']:.2f})")
Please check the API Reference for complete endpoint documentation.