Skip to main content

What are Workflow Evals?

Workflow evals are saved, block-level regression checks. Each eval freezes the inputs for one workflow block, replays that block against the current workflow draft, and evaluates one assertion against the block output. Use evals when you want to answer questions like:
  • Does this Extract block still return the expected invoice total?
  • Does this Function block still compute the validation flag correctly?
  • Does this Split block still assign pages to the right subdocuments?
  • Does this Classifier block still route a file to the expected category?
Evals are intentionally block-scoped. Running an eval executes the selected block with saved inputs instead of replaying the entire workflow, which keeps the feedback loop short while you adjust schemas, prompts, function code, categories, or split definitions.

Supported Blocks

Workflow evals are currently supported for:
BlockWhat you usually assert
ExtractExtracted JSON fields
FunctionReturned JSON fields from the function output schema
SplitSplit manifest quality or produced subdocument file handles
ClassifierProduced category file handles
Other block types, such as input blocks, notes, API calls, conditionals, review gates, and loops, do not currently produce workflow eval runs directly.

Eval Shape

A workflow eval stores:
  1. Target - the block to run, currently { "type": "block", "block_id": "..." }.
  2. Source - the handle inputs to replay.
  3. Assertion - one expected condition for one declared output handle.
{
  "target": { "type": "block", "block_id": "block_extract_invoice" },
  "source": {
    "type": "manual",
    "handle_inputs": {
      "input-document-0": {
        "type": "file",
        "document": {
          "id": "file_invoice_q1",
          "filename": "invoice.pdf",
          "mime_type": "application/pdf"
        }
      }
    }
  },
  "assertion": {
    "target": { "output_handle_id": "output-json-0", "path": "total" },
    "condition": { "kind": "equals", "expected": 1234.56 }
  }
}
source.type can be:
SourceFieldsUse it when
manualhandle_inputsYou want to provide explicit JSON or file inputs.
run_steprun_id, optional step_idYou want to capture the inputs a block received in a real run.
For blocks executed inside a loop, provide step_id so Retab knows which iteration’s inputs to capture. File inputs are materialized as durable Retab file references so the eval does not depend on the original browser upload session.

Assertions

An assertion targets a declared output handle and, optionally, a dotted path inside that handle’s JSON payload.
BlockOutput target examples
Extractoutput-json-0.total, output-json-0.vendor.name
Functionoutput-json-0.is_valid, output-json-0.error_message
Splitoutput-json-splits or a subdocument file handle
ClassifierA category file handle
The current condition kinds are:
KindUse for
exists, not_existsPresence checks.
equals, not_equalsStrict value equality or inequality.
contains, not_containsSubstring or list membership checks.
number_compare, betweenNumeric comparisons.
starts_with, ends_with, matches_regexString pattern checks.
object_contains, array_containsSubset checks for objects and arrays of objects.
length_compareLength checks for strings, arrays, or objects.
json_schema_validJSON Schema validation for a target subtree.
all_items_match, any_item_matchesNested assertions over array items.
similarity_gteSimilarity thresholds.
llm_judged_as, llm_not_judged_asRubric-based LLM judging.
split_iou_gteIntersection-over-Union for split page assignments.
number_compare and length_compare use op values gt, gte, lt, lte, eq, or neq.

Running Evals

You can run one eval, every eval for one block, or every eval in a workflow. In the API, create a parent eval run with:
POST /v1/workflows/evals/runs
The request body always includes workflow_id. scope is optional:
ScopeMeaning
omitted or nullRun every saved eval in the workflow.
{ "type": "workflow" }Run every saved eval in the workflow.
{ "type": "block" }Run every saved eval for one block.
{ "type": "single" }Run one saved eval by eval_id.
When an eval run starts, Retab:
  1. Loads the current workflow draft and current block configuration.
  2. Rebuilds the saved handle inputs into normal runtime inputs.
  3. Executes only the selected block.
  4. Stores the block artifact, handle outputs, routing decisions, warnings, and timing.
  5. Resolves the assertion target from the handle outputs.
  6. Records the assertion outcome and the result verdict.
Eval runs are asynchronous. Poll GET /v1/workflows/evals/runs/{run_id} until the parent run reaches a terminal lifecycle, then read child rows from GET /v1/workflows/evals/results?run_id={run_id}.

Results

A parent WorkflowEvalRun has a lifecycle:
LifecycleMeaning
pendingThe run was created but execution has not started.
queuedThe run is waiting for a worker.
runningOne or more eval results are executing.
completedThe run finished; inspect outcome counts or rows.
errorThe run failed before normal completion.
cancelledThe run was cancelled.
Parent counts separates lifecycle from assertion outcomes:
{
  "lifecycle_counts": {
    "pending": 0,
    "queued": 0,
    "running": 0,
    "completed": 1,
    "error": 0,
    "cancelled": 0
  },
  "outcome": {
    "passed": 1,
    "failed": 0,
    "blocked": 0
  }
}
Each WorkflowEvalResult row has its own lifecycle and, once completed, a verdict of passed, failed, or blocked. The nested assertion_result.outcome uses the same three outcome values and includes actual_value, expected_value, optional score/threshold fields, and failure details when the assertion cannot pass. Result rows also include the saved handle_inputs, produced handle_outputs, workflow/block fingerprints, the execution artifact, routing decisions, warnings, and timing.

Freshness and Drift

Workflow evals are tied to the block inputs and output schema that existed when the eval was created or last updated. When the workflow draft changes, Retab reports several freshness signals:
FieldValuesMeaning
schema_driftnone, partial, drifted, unknownWhether the assertion target still resolves.
assertion_drift_statusvalid, drifted, brokenWhether the saved assertion is still usable.
freshness.statusfresh, stale, unknownWhether the latest run matches the baseline.
drift.statusnone, drifted, broken, unknownArtifact-level drift summary.
The eval also stores latest_run_summary, latest_passing_run_summary, and latest_failing_run_summary. Each summary separates run lifecycle (status) from assertion outcome (outcome). Staleness does not automatically mean the workflow is broken. It means the eval should be rerun or recaptured before you rely on its latest result.
  1. Run the workflow on representative documents.
  2. Open the Evals page and create an eval from a completed run, or create one with explicit manual inputs.
  3. Pick the block output field or handle you want to protect.
  4. Define one assertion for the expected behavior.
  5. Run the eval after changing schemas, prompts, code, categories, or split definitions.
  6. Use stale evals as a review queue before publishing workflow changes.
Evals work best when each one protects one behavior. Prefer multiple focused evals over one broad assertion so failures point directly to the changed output.

API Reference