Evals - Retab Docs

What are Workflow Evals?

Workflow evals are saved, block-level regression checks. Each eval freezes the inputs for one workflow block, replays that block against the current workflow draft, and evaluates one assertion against the block output. Use evals when you want to answer questions like:

Does this Extract block still return the expected invoice total?
Does this Function block still compute the validation flag correctly?
Does this Split block still assign pages to the right subdocuments?
Does this Classifier block still route a file to the expected category?

Evals are intentionally block-scoped. Running an eval executes the selected block with saved inputs instead of replaying the entire workflow, which keeps the feedback loop short while you adjust schemas, prompts, function code, categories, or split definitions.

Supported Blocks

Workflow evals are currently supported for:

Block	What you usually assert
Extract	Extracted JSON fields
Function	Returned JSON fields from the function output schema
Split	Split manifest quality or produced subdocument file handles
Classifier	Produced category file handles

Other block types, such as input blocks, notes, API calls, conditionals, review gates, and loops, do not currently produce workflow eval runs directly.

Eval Shape

A workflow eval stores:

Target - the block to run, currently { "type": "block", "block_id": "..." }.
Source - the handle inputs to replay.
Assertion - one expected condition for one declared output handle.

{
  "target": { "type": "block", "block_id": "block_extract_invoice" },
  "source": {
    "type": "manual",
    "handle_inputs": {
      "input-document-0": {
        "type": "file",
        "document": {
          "id": "file_invoice_q1",
          "filename": "invoice.pdf",
          "mime_type": "application/pdf"
        }
      }
    }
  },
  "assertion": {
    "target": { "output_handle_id": "output-json-0", "path": "total" },
    "condition": { "kind": "equals", "expected": 1234.56 }
  }
}

source.type can be:

Source	Fields	Use it when
`manual`	`handle_inputs`	You want to provide explicit JSON or file inputs.
`run_step`	`run_id`, optional `step_id`	You want to capture the inputs a block received in a real run.

For blocks executed inside a loop, provide step_id so Retab knows which iteration’s inputs to capture. File inputs are materialized as durable Retab file references so the eval does not depend on the original browser upload session.

Assertions

An assertion targets a declared output handle and, optionally, a dotted path inside that handle’s JSON payload.

Block	Output target examples
Extract	`output-json-0.total`, `output-json-0.vendor.name`
Function	`output-json-0.is_valid`, `output-json-0.error_message`
Split	`output-json-splits` or a subdocument file handle
Classifier	A category file handle

The current condition kinds are:

Kind	Use for
`exists`, `not_exists`	Presence checks.
`equals`, `not_equals`	Strict value equality or inequality.
`contains`, `not_contains`	Substring or list membership checks.
`number_compare`, `between`	Numeric comparisons.
`starts_with`, `ends_with`, `matches_regex`	String pattern checks.
`object_contains`, `array_contains`	Subset checks for objects and arrays of objects.
`length_compare`	Length checks for strings, arrays, or objects.
`json_schema_valid`	JSON Schema validation for a target subtree.
`all_items_match`, `any_item_matches`	Nested assertions over array items.
`similarity_gte`	Similarity thresholds.
`llm_judged_as`, `llm_not_judged_as`	Rubric-based LLM judging.
`split_iou_gte`	Intersection-over-Union for split page assignments.

number_compare and length_compare use op values gt, gte, lt, lte, eq, or neq.

Running Evals

You can run one eval, every eval for one block, or every eval in a workflow. In the API, create a parent eval run with:

POST /v1/workflows/evals/runs

The request body always includes workflow_id. scope is optional:

Scope	Meaning
omitted or `null`	Run every saved eval in the workflow.
`{ "type": "workflow" }`	Run every saved eval in the workflow.
`{ "type": "block" }`	Run every saved eval for one block.
`{ "type": "single" }`	Run one saved eval by `eval_id`.

When an eval run starts, Retab:

Loads the current workflow draft and current block configuration.
Rebuilds the saved handle inputs into normal runtime inputs.
Executes only the selected block.
Stores the block artifact, handle outputs, routing decisions, warnings, and timing.
Resolves the assertion target from the handle outputs.
Records the assertion outcome and the result verdict.

Eval runs are asynchronous. Poll GET /v1/workflows/evals/runs/{run_id} until the parent run reaches a terminal lifecycle, then read child rows from GET /v1/workflows/evals/results?run_id={run_id}.

Results

A parent WorkflowEvalRun has a lifecycle:

Lifecycle	Meaning
`pending`	The run was created but execution has not started.
`queued`	The run is waiting for a worker.
`running`	One or more eval results are executing.
`completed`	The run finished; inspect outcome counts or rows.
`error`	The run failed before normal completion.
`cancelled`	The run was cancelled.

Parent counts separates lifecycle from assertion outcomes:

{
  "lifecycle_counts": {
    "pending": 0,
    "queued": 0,
    "running": 0,
    "completed": 1,
    "error": 0,
    "cancelled": 0
  },
  "outcome": {
    "passed": 1,
    "failed": 0,
    "blocked": 0
  }
}

Each WorkflowEvalResult row has its own lifecycle and, once completed, a verdict of passed, failed, or blocked. The nested assertion_result.outcome uses the same three outcome values and includes actual_value, expected_value, optional score/threshold fields, and failure details when the assertion cannot pass. Result rows also include the saved handle_inputs, produced handle_outputs, workflow/block fingerprints, the execution artifact, routing decisions, warnings, and timing.

Freshness and Drift

Workflow evals are tied to the block inputs and output schema that existed when the eval was created or last updated. When the workflow draft changes, Retab reports several freshness signals:

Field	Values	Meaning
`schema_drift`	`none`, `partial`, `drifted`, `unknown`	Whether the assertion target still resolves.
`assertion_drift_status`	`valid`, `drifted`, `broken`	Whether the saved assertion is still usable.
`freshness.status`	`fresh`, `stale`, `unknown`	Whether the latest run matches the baseline.
`drift.status`	`none`, `drifted`, `broken`, `unknown`	Artifact-level drift summary.

The eval also stores latest_run_summary, latest_passing_run_summary, and latest_failing_run_summary. Each summary separates run lifecycle (status) from assertion outcome (outcome). Staleness does not automatically mean the workflow is broken. It means the eval should be rerun or recaptured before you rely on its latest result.

Recommended Workflow

Run the workflow on representative documents.
Open the Evals page and create an eval from a completed run, or create one with explicit manual inputs.
Pick the block output field or handle you want to protect.
Define one assertion for the expected behavior.
Run the eval after changing schemas, prompts, code, categories, or split definitions.
Use stale evals as a review queue before publishing workflow changes.

Evals work best when each one protects one behavior. Prefer multiple focused evals over one broad assertion so failures point directly to the changed output.

API Reference

Action	Endpoint
Create/list	`/v1/workflows/evals`
Get/update/delete	`/v1/workflows/evals/{eval_id}`
Run evals	`/v1/workflows/evals/runs`
Read results	`/v1/workflows/evals/results`

​What are Workflow Evals?

​Supported Blocks

​Eval Shape

​Assertions

​Running Evals

​Results

​Freshness and Drift

​Recommended Workflow

​API Reference