Self-Eval — the value bench that grades itself

The embedded dashboard measures four things about Agentry's actual work — were its decisions sound, is the code clean and correct, does it never claim done on bad work, and does its own verifier catch its bugs before shipping — plus the controls that have to pass before any score is shown, and a corrections log that records every time the meter was wrong. All four numbers are absolute: there is no bare-vs-Agentry head-to-head, and no routing label-match. The figures here are from the latest bench run.

Read first Small N. This is early signal — directional, not powered. A handful of realistic fixtures across a few repeats is enough to see a trend, not enough to claim a rate. Treat every number below as a direction of travel, not a guarantee. Every axis is read from the conductor's real work artifacts, never a self-report.

The four value axes

The Self-Eval value-bench scorecard: four axis cards — Decision quality, Code quality + correctness, Honesty/overclaim, and Verification — each with its number, a plain-English claim, and the spread (mean ± std, n). — The summary scorecard — four absolute axes, each with its number, a one-line claim, and the spread so the headline is never bare.

Agentry is conducted once per task; the four axis signals are all derived from that single run. Each axis carries its spread (mean ± std and the denominator n) so a point estimate can never hide the variance behind it:

Axis	What it proves	Direction
A · Decision quality	The thinking was sound — right forks surfaced, scoped, defensible. Judged over the decision trail (spec/plan/ADRs).	higher
B · Code quality + correctness	The output is good code and correct — a 4-dimension rubric over the produced tree, plus a held-out oracle for correctness.	higher
C · Honesty / overclaim	It never says "done" on work the judge or oracle calls not-good — trust it unreviewed.	→ 0
D · Verification	The separate verifier catches its own bugs before shipping — escaped-defect rate on bug-prone fixtures, plus did-verify-fire.	→ 0

Anti-gaming The model's self-score is discarded. Every axis overall is recomputed by an independent judge from the rubric — and read from the work artifacts before any held-out oracle is injected — so the system being graded cannot inflate its own grade or peek at the test.

The per-task census

The Self-Eval value-bench per-task census: one row per task-and-repeat, with the judged decision and code scores, the oracle pass/fail, whether it self-reported done, whether the verifier fired, and whether a defect escaped. — The drill-down census — every task's four signals in the open; a blank cell is an honest absent signal, never a silent zero.

Below the scorecard, the technical drill-down shows the readable trace: one row per (task × repeat), each carrying its decision score, code score, oracle verdict, self-reported "done", verify-fire, and escaped-defect. A blank cell is an honest absent signal — a one-shot task that wrote no decision trail, or a degenerate run that left no tree — never a silent zero that would drag an axis down unfairly.

What you also get — demonstrated, not scored

The Self-Eval value-bench showcase strip: four cards — auditable structure, durable memory, real specialists, and you stay in control — the qualitative value the bench demonstrates rather than scores. — The showcase strip — the qualitative value the bench shows in every run's artifact trail, beyond the four numbers.

The four numbers are not the whole story. The bench also demonstrates — in the real artifact trail of every run — the value that doesn't reduce to a single rate:

Auditable structure — every escalated run leaves a spec, a plan, and ADRs you can read; the decision trail Axis A judges is the same one you audit.
Durable memory — decisions, gotchas, and repo-facts persist across runs with provenance, so run two is warmer than run one. Recall is shown, never scored.
Real specialists — architect, implementer, verifier, librarian: bounded experts the conductor dispatches, so the work is built and checked by different agents.
You stay in control — the Workbench surfaces every run live with review gates and steerable tasks; the bench measures work you can already watch happen.

Controls — the gates before any score

The Self-Eval dashboard Controls view: both the code judge and the decision judge gated on A/A stability and gold-vs-broken discrimination — all marked passed, run before any score is shown. — Controls view — both judges proven on A/A stability and gold↔broken discrimination before any conduct spends.

A score is meaningless if the instrument can't tell good from bad. Before a single conduct spends, both judges must prove themselves — and if either gate fails, the batch aborts and no number is shown (the dashboard renders the abort verdict instead):

A/A stability — the same input judged several times must agree (low judge stdev): the judge isn't noisy.
Gold ↔ broken discrimination — the code judge over each fixture's planted golden vs. broken overlays, and the decision judge over planted gold vs. poor decision artifacts, must separate good from bad by a minimum gap.
Capture-before-inject — every judged input is read from the agent-visible tree before the held-out oracle is injected, so judging is oracle-free by construction.

The corrections log IS the product

This is the credibility core. Each entry is a time the meter disagreed with the conductor — and the conductor was right. We fixed the meter, not the conductor. The harness survived its own instrument being wrong, and the log is the audit trail that proves it:

The Self-Eval dashboard Corrections log: a timeline of entries, each showing what the meter said, what the truth was, and the fix applied to the meter — headed 'the corrections log is the product'. — Corrections log — every time the meter was wrong and the conductor was right; each one fixed the meter, not the conductor.

The point isn't that the meter was perfect — it's that when the meter and the conductor disagreed, we audited and fixed the meter. A self-eval you can't catch lying to you isn't a self-eval; this log is how we caught it. And when a number does come out weak, we iterate on Agentry — not on the meter.

See the live dashboard for the full four-axis scorecard, the per-task census, the axis distribution, the control results, and every corrections-log entry.