Work you can trust unreviewed
We conduct Agentry once per task and judge its real work on four axes — the decisions it made, the code it shipped, that it never lies about being done, and that its own verifier catches its bugs. Every number is read from the conductor's artifacts, never a self-report, behind controls that abort the batch before a fooled number ships.
What you also get
Beyond the four numbers: the qualitative value the bench shows in the artifact trail of every run.
Per-task census
One row per (task × repeat). Every axis signal, in the open — click a row for the full record. A blank cell is an honest absent signal (a one-shot wrote no decision trail; a degenerate run left no tree), never a silent zero.
Axis distribution
Each axis as a bar with its mean and the denominator it was measured over — so the headline can never hide the spread or the small N behind it.
Controls — the gates before any score
Before a single conduct spends, both judges must prove themselves: agree with themselves on a re-judge (A/A), and separate planted-good from planted-bad work (discrimination). If a gate fires, the batch aborts and no number is shown.
Each entry is a time the meter disagreed with the conductor — and the conductor was right. We fixed the meter, not the conductor. This is why the headline is believable: it survived its own instrument being wrong — times.
Run history
Every captured run, stored under selfeval/runs/<id>. The report is regenerated from the store with zero live API.