Value Bench

/ summary model — fixtures ×— repeats
Value bench · absolute, no baseline

Work you can trust unreviewed

We conduct Agentry once per task and judge its real work on four axes — the decisions it made, the code it shipped, that it never lies about being done, and that its own verifier catches its bugs. Every number is read from the conductor's artifacts, never a self-report, behind controls that abort the batch before a fooled number ships.

Demonstrated, not scored

What you also get

Beyond the four numbers: the qualitative value the bench shows in the artifact trail of every run.

Drill-down · the readable trace

Per-task census

One row per (task × repeat). Every axis signal, in the open — click a row for the full record. A blank cell is an honest absent signal (a one-shot wrote no decision trail; a degenerate run left no tree), never a silent zero.

Drill-down · spread, not a point

Axis distribution

Each axis as a bar with its mean and the denominator it was measured over — so the headline can never hide the spread or the small N behind it.

Drill-down · the meter can't flatter itself

Controls — the gates before any score

Before a single conduct spends, both judges must prove themselves: agree with themselves on a re-judge (A/A), and separate planted-good from planted-bad work (discrimination). If a gate fires, the batch aborts and no number is shown.

The corrections log is the product

Each entry is a time the meter disagreed with the conductor — and the conductor was right. We fixed the meter, not the conductor. This is why the headline is believable: it survived its own instrument being wrong — times.

Trust · capture-once, analyze-many

Run history

Every captured run, stored under selfeval/runs/<id>. The report is regenerated from the store with zero live API.