Documentation Index
Fetch the complete documentation index at: https://docs.retab.com/llms.txt
Use this file to discover all available pages before exploring further.
What are Experiments?
Experiments are controlled, block-level evaluations for workflow blocks. They
run the same block with multiple consensus passes over the same set of
documents and use agreement between those passes as the quality signal.
Use experiments when you want to answer questions like:
- Did this schema change make extraction fields more stable?
- Which invoice documents are causing low agreement?
- Which split category or classifier category is ambiguous?
- Did a prompt, category, or split-definition change improve the block?
Experiments do not require ground-truth labels. They are consensus evals: Retab
asks the block to produce several independent candidate outputs, compares those
candidate outputs, and reports where the candidates agree or disagree. A higher
score means stronger agreement. A low score points to an unstable document,
field, category, subdocument type, or split-by-key partition.
Experiments vs Tests
Tests and experiments both help you keep workflow changes under control, but
they answer different questions.
| Tool | Best for | Signal |
|---|
| Tests | Checking a specific expected output | Pass, fail, or error against an assertion |
| Experiments | Measuring output stability across documents | Consensus score and disagreement details |
Use a test when you know what the output should be. Use an experiment when you
want to find weak spots, compare block configurations, or inspect whether a
block is internally consistent before you write stricter assertions.
Supported Blocks
Experiments are currently supported for:
| Block | What Retab measures |
|---|
| Extract | Field-level agreement for extracted JSON values |
| Split | Agreement on subdocument/page assignments |
| Classifier | Agreement on routing category decisions |
| For Each | Key-level agreement when the block is configured as split-by-key |
Other workflow blocks can still be tested with workflow tests, but they do not
currently produce experiment metrics.
How Experiments Work
An experiment is attached to one block in one workflow. It stores:
- The block under test - the workflow block id and block kind.
- A fixed document set - materialized block inputs captured from completed
workflow runs or from files uploaded while creating the experiment.
- A consensus count - 3, 5, or 7 independent passes.
When you run the experiment, Retab freezes the current block configuration and
replays the selected block for each document. Each document execution becomes an
experiment job. The job stores the canonical artifact produced by the block:
| Block | Artifact |
|---|
| Extract | Extraction |
| Split | Split |
| Classifier | Classification |
| For Each split-by-key | Partition |
Metrics are normalized across the same shape for every supported block:
document x target x voter
The target depends on the block type:
| Block | Target |
|---|
| Extract | Field |
| Split | Subdocument |
| Classifier | Category |
| For Each split-by-key | Key |
This lets the dashboard show the same core views for different block types:
overall summary, by-document scores, by-target scores, and voter-level
disagreement.
Creating an Experiment
- Open a workflow in the dashboard.
- Go to Console -> Experiments.
- Click Experiment.
- Name the experiment.
- Choose the number of consensus passes: 3, 5, or 7.
- Select the block to evaluate.
- Select files from completed runs, or upload files for the experiment.
- Create the experiment.
When you create an experiment from the dashboard, Retab immediately starts
computing metrics. The experiment page updates while the run is pending or
running.
Reading Results
The experiment detail page has three sections:
| Section | Purpose |
|---|
| Config | Review or edit the block configuration being evaluated. |
| Data | Inspect per-document outputs and the underlying artifacts. |
| Metrics | Analyze consensus scores, weak targets, document-level failures, and voter disagreements. |
The metrics views help you move from broad signal to specific evidence:
- Summary shows the overall experiment score, target averages, document
averages, and previous-run delta when available.
- By document shows which files are least stable.
- By target shows which fields, categories, subdocuments, or keys are least
stable across the document set.
- Votes shows the individual candidate outputs for one document-target
cell, including the consensus value and disagreements.
Split and classifier experiments also expose specialized visualizations, such as
confusion-style views, to make routing and page-assignment ambiguity easier to
inspect.
Staleness and Re-runs
Experiment metrics belong to a specific block configuration and document set. If
you edit the block or change the experiment documents, Retab marks the latest
metrics as stale. If the output schema changes, Retab can also report schema
drift.
When an experiment is stale, run it again to refresh the score against the
current workflow draft. Retab keeps run history, so you can compare the latest
score with earlier runs and see whether a configuration change improved or
degraded the block.
Recommended Workflow
- Run the workflow with representative documents.
- Create an experiment for an Extract, Split, Classifier, or split-by-key For
Each block.
- Start with 3 consensus passes while iterating quickly.
- Inspect the lowest-scoring documents and targets.
- Adjust the schema, prompt, categories, or split definitions.
- Re-run the experiment and compare the score with the previous run.
- Add workflow tests for outputs that should now be protected with explicit
assertions.
Experiments work best as a discovery and comparison tool. They tell you where a
block is uncertain; tests then lock in the behaviors you decide are correct.
Using the SDK
The dashboard flow above maps onto a small set of SDK calls. The same calls back
the MCP tool surface, so anything you can do interactively or
through an agent you can do programmatically.
Create an experiment
Pick a supported block, give the experiment a name and a document set, and
choose how many consensus passes per document (3, 5, or 7). Documents come from
prior workflow runs (via document_captures) or as explicit handle inputs.
from retab import Retab
client = Retab()
experiment = client.workflows.experiments.create(
workflow_id="wf_abc123",
block_id="extract-invoice",
name="Q1 invoices",
document_captures=[
{"workflow_run_id": "wfrun_1"},
{"workflow_run_id": "wfrun_2", "step_id": "for_each-0"},
],
n_consensus=5,
)
// The Go SDK does not yet model the workflow experiments API. Call the
// /v1/workflows/{workflow_id}/experiments endpoints directly.
package main
import (
"bytes"
"encoding/json"
"fmt"
"io"
"log"
"net/http"
"os"
)
func main() {
body, err := json.Marshal(map[string]any{
"block_id": "extract-invoice",
"name": "Q1 invoices",
"document_captures": []map[string]any{
{"workflow_run_id": "wfrun_1"},
{"workflow_run_id": "wfrun_2", "step_id": "for_each-0"},
},
"n_consensus": 5,
})
if err != nil {
log.Fatal(err)
}
req, err := http.NewRequest(
http.MethodPost,
"https://api.retab.com/v1/workflows/wf_abc123/experiments",
bytes.NewReader(body),
)
if err != nil {
log.Fatal(err)
}
req.Header.Set("Api-Key", os.Getenv("RETAB_API_KEY"))
req.Header.Set("Content-Type", "application/json")
resp, err := http.DefaultClient.Do(req)
if err != nil {
log.Fatal(err)
}
defer resp.Body.Close()
payload, _ := io.ReadAll(resp.Body)
fmt.Println(string(payload))
}
Creating an experiment does NOT trigger a run — the document set is registered
but no metrics exist yet.
Run the experiment
Trigger consensus runs against the current draft block config. This is async —
the SDK returns a job_id and you poll for completion.
run = client.workflows.experiments.runs.create(
workflow_id="wf_abc123",
experiment_id=experiment.id,
)
# Block until the job finishes (or raises on failure / timeout).
client.jobs.wait_for_completion(run.job_id)
// Trigger an experiment run. Polling for the resulting job is left to the caller.
package main
import (
"bytes"
"fmt"
"io"
"log"
"net/http"
"os"
)
func main() {
req, err := http.NewRequest(
http.MethodPost,
"https://api.retab.com/v1/workflows/wf_abc123/experiments/exp_abc123/run",
bytes.NewReader([]byte("{}")),
)
if err != nil {
log.Fatal(err)
}
req.Header.Set("Api-Key", os.Getenv("RETAB_API_KEY"))
req.Header.Set("Content-Type", "application/json")
resp, err := http.DefaultClient.Do(req)
if err != nil {
log.Fatal(err)
}
defer resp.Body.Close()
payload, _ := io.ReadAll(resp.Body)
fmt.Println(string(payload))
}
Pass n_consensus=... to override per-run, or retry_failed_only=True to
re-run only the documents that errored in the previous run.
Read metrics
Metrics live behind one endpoint with four views — start at summary, drill
into by_target on low-scoring fields, then into votes to see voter
disagreement on a specific cell.
summary = client.workflows.experiments.get_metrics(
workflow_id="wf_abc123",
experiment_id=experiment.id,
view="summary",
)
# A weak field surfaces in summary.aggregate.likelihoods → drill in
target_view = client.workflows.experiments.get_metrics(
workflow_id="wf_abc123",
experiment_id=experiment.id,
view="by_target",
target_path="line_items.*.unit_price",
)
# To see what each voter said for one document/target cell
votes = client.workflows.experiments.get_metrics(
workflow_id="wf_abc123",
experiment_id=experiment.id,
view="votes",
document_id="expdoc_xyz",
target_path="line_items.*.unit_price",
)
package main
import (
"fmt"
"io"
"log"
"net/http"
"net/url"
"os"
)
func fetchMetrics(query url.Values) ([]byte, error) {
endpoint := "https://api.retab.com/v1/workflows/wf_abc123/experiments/exp_abc123/metrics?" + query.Encode()
req, err := http.NewRequest(http.MethodGet, endpoint, nil)
if err != nil {
return nil, err
}
req.Header.Set("Api-Key", os.Getenv("RETAB_API_KEY"))
resp, err := http.DefaultClient.Do(req)
if err != nil {
return nil, err
}
defer resp.Body.Close()
return io.ReadAll(resp.Body)
}
func main() {
summary, err := fetchMetrics(url.Values{"view": {"summary"}})
if err != nil {
log.Fatal(err)
}
fmt.Println(string(summary))
// A weak field surfaces in summary.aggregate.likelihoods - drill in
target, err := fetchMetrics(url.Values{
"view": {"by_target"},
"target_path": {"line_items.*.unit_price"},
})
if err != nil {
log.Fatal(err)
}
fmt.Println(string(target))
// To see what each voter said for one document/target cell
votes, err := fetchMetrics(url.Values{
"view": {"votes"},
"document_id": {"expdoc_xyz"},
"target_path": {"line_items.*.unit_price"},
})
if err != nil {
log.Fatal(err)
}
fmt.Println(string(votes))
}
If the latest run is stale relative to the current block config or document
set, get_metrics returns a stale_metrics error envelope — call
runs.create(...) to recompute.
Update / delete / duplicate
# Change the document set or n_consensus — invalidates existing metrics
client.workflows.experiments.update(
workflow_id="wf_abc123",
experiment_id=experiment.id,
n_consensus=7,
)
# Snapshot an experiment for a quick A/B against a config change
copy = client.workflows.experiments.duplicate(
workflow_id="wf_abc123",
experiment_id=experiment.id,
)
client.workflows.experiments.delete(
workflow_id="wf_abc123",
experiment_id=experiment.id,
)
package main
import (
"bytes"
"encoding/json"
"fmt"
"io"
"log"
"net/http"
"os"
)
const base = "https://api.retab.com/v1/workflows/wf_abc123/experiments/exp_abc123"
func send(method, url string, body any) ([]byte, error) {
var reader *bytes.Reader
if body != nil {
raw, err := json.Marshal(body)
if err != nil {
return nil, err
}
reader = bytes.NewReader(raw)
} else {
reader = bytes.NewReader(nil)
}
req, err := http.NewRequest(method, url, reader)
if err != nil {
return nil, err
}
req.Header.Set("Api-Key", os.Getenv("RETAB_API_KEY"))
if body != nil {
req.Header.Set("Content-Type", "application/json")
}
resp, err := http.DefaultClient.Do(req)
if err != nil {
return nil, err
}
defer resp.Body.Close()
return io.ReadAll(resp.Body)
}
func main() {
// Change n_consensus - invalidates existing metrics
updated, err := send(http.MethodPatch, base, map[string]any{"n_consensus": 7})
if err != nil {
log.Fatal(err)
}
fmt.Println(string(updated))
// Snapshot an experiment for a quick A/B against a config change
copy, err := send(http.MethodPost, base+"/duplicate", map[string]any{})
if err != nil {
log.Fatal(err)
}
fmt.Println(string(copy))
// Delete the experiment
if _, err := send(http.MethodDelete, base, nil); err != nil {
log.Fatal(err)
}
}
Run every experiment on a block at once
Useful when you’ve changed a block config and want to refresh every experiment
attached to it in one call.
result = client.workflows.experiments.run_batch(
workflow_id="wf_abc123",
block_id="extract-invoice",
)
print(f"Triggered {result.experiment_count} experiments")
package main
import (
"bytes"
"encoding/json"
"fmt"
"io"
"log"
"net/http"
"os"
)
func main() {
body, err := json.Marshal(map[string]any{"block_id": "extract-invoice"})
if err != nil {
log.Fatal(err)
}
req, err := http.NewRequest(
http.MethodPost,
"https://api.retab.com/v1/workflows/wf_abc123/experiments/run-batch",
bytes.NewReader(body),
)
if err != nil {
log.Fatal(err)
}
req.Header.Set("Api-Key", os.Getenv("RETAB_API_KEY"))
req.Header.Set("Content-Type", "application/json")
resp, err := http.DefaultClient.Do(req)
if err != nil {
log.Fatal(err)
}
defer resp.Body.Close()
payload, _ := io.ReadAll(resp.Body)
fmt.Println(string(payload))
}
For the full method reference (including async variants under
AsyncRetab.workflows.experiments), see the
API reference.