Reference · Self-Eval

Self-Eval — the value bench that grades itself

Agentry ships with a value bench that conducts its own real work once per task and judges it on four absolute axes — no baseline, no routing label-match. The numbers below are early signal, directional not powered, and we say so plainly. The corrections log is the part to trust.

The embedded dashboard measures four things about Agentry's actual work — were its decisions sound, is the code clean and correct, does it never claim done on bad work, and does its own verifier catch its bugs before shipping — plus the controls that have to pass before any score is shown, and a corrections log that records every time the meter was wrong. All four numbers are absolute: there is no bare-vs-Agentry head-to-head, and no routing label-match. The figures here are from the latest bench run.

Read first Small N. This is early signal — directional, not powered. A handful of realistic fixtures across a few repeats is enough to see a trend, not enough to claim a rate. Treat every number below as a direction of travel, not a guarantee. Every axis is read from the conductor's real work artifacts, never a self-report.

The four value axes

The Self-Eval value-bench scorecard: four axis cards — Decision quality, Code quality + correctness, Honesty/overclaim, and Verification — each with its number, a plain-English claim, and the spread (mean ± std, n).
The summary scorecard — four absolute axes, each with its number, a one-line claim, and the spread so the headline is never bare.

Agentry is conducted once per task; the four axis signals are all derived from that single run. Each axis carries its spread (mean ± std and the denominator n) so a point estimate can never hide the variance behind it:

AxisWhat it provesDirection
A · Decision qualityThe thinking was sound — right forks surfaced, scoped, defensible. Judged over the decision trail (spec/plan/ADRs).higher
B · Code quality + correctnessThe output is good code and correct — a 4-dimension rubric over the produced tree, plus a held-out oracle for correctness.higher
C · Honesty / overclaimIt never says "done" on work the judge or oracle calls not-good — trust it unreviewed.→ 0
D · VerificationThe separate verifier catches its own bugs before shipping — escaped-defect rate on bug-prone fixtures, plus did-verify-fire.→ 0
Anti-gaming The model's self-score is discarded. Every axis overall is recomputed by an independent judge from the rubric — and read from the work artifacts before any held-out oracle is injected — so the system being graded cannot inflate its own grade or peek at the test.

The per-task census

The Self-Eval value-bench per-task census: one row per task-and-repeat, with the judged decision and code scores, the oracle pass/fail, whether it self-reported done, whether the verifier fired, and whether a defect escaped.
The drill-down census — every task's four signals in the open; a blank cell is an honest absent signal, never a silent zero.

Below the scorecard, the technical drill-down shows the readable trace: one row per (task × repeat), each carrying its decision score, code score, oracle verdict, self-reported "done", verify-fire, and escaped-defect. A blank cell is an honest absent signal — a one-shot task that wrote no decision trail, or a degenerate run that left no tree — never a silent zero that would drag an axis down unfairly.

What you also get — demonstrated, not scored

The Self-Eval value-bench showcase strip: four cards — auditable structure, durable memory, real specialists, and you stay in control — the qualitative value the bench demonstrates rather than scores.
The showcase strip — the qualitative value the bench shows in every run's artifact trail, beyond the four numbers.

The four numbers are not the whole story. The bench also demonstrates — in the real artifact trail of every run — the value that doesn't reduce to a single rate:

  • Auditable structure — every escalated run leaves a spec, a plan, and ADRs you can read; the decision trail Axis A judges is the same one you audit.
  • Durable memory — decisions, gotchas, and repo-facts persist across runs with provenance, so run two is warmer than run one. Recall is shown, never scored.
  • Real specialists — architect, implementer, verifier, librarian: bounded experts the conductor dispatches, so the work is built and checked by different agents.
  • You stay in control — the Workbench surfaces every run live with review gates and steerable tasks; the bench measures work you can already watch happen.

Controls — the gates before any score

The Self-Eval dashboard Controls view: both the code judge and the decision judge gated on A/A stability and gold-vs-broken discrimination — all marked passed, run before any score is shown.
Controls view — both judges proven on A/A stability and gold↔broken discrimination before any conduct spends.

A score is meaningless if the instrument can't tell good from bad. Before a single conduct spends, both judges must prove themselves — and if either gate fails, the batch aborts and no number is shown (the dashboard renders the abort verdict instead):

  • A/A stability — the same input judged several times must agree (low judge stdev): the judge isn't noisy.
  • Gold ↔ broken discrimination — the code judge over each fixture's planted golden vs. broken overlays, and the decision judge over planted gold vs. poor decision artifacts, must separate good from bad by a minimum gap.
  • Capture-before-inject — every judged input is read from the agent-visible tree before the held-out oracle is injected, so judging is oracle-free by construction.

The corrections log IS the product

This is the credibility core. Each entry is a time the meter disagreed with the conductor — and the conductor was right. We fixed the meter, not the conductor. The harness survived its own instrument being wrong, and the log is the audit trail that proves it:

The Self-Eval dashboard Corrections log: a timeline of entries, each showing what the meter said, what the truth was, and the fix applied to the meter — headed 'the corrections log is the product'.
Corrections log — every time the meter was wrong and the conductor was right; each one fixed the meter, not the conductor.

The point isn't that the meter was perfect — it's that when the meter and the conductor disagreed, we audited and fixed the meter. A self-eval you can't catch lying to you isn't a self-eval; this log is how we caught it. And when a number does come out weak, we iterate on Agentry — not on the meter.

See the live dashboard for the full four-axis scorecard, the per-task census, the axis distribution, the control results, and every corrections-log entry.