Adaptive agentic layer for Claude Code

The agentic layer that
measures itself.

Right-sized orchestration, SDLC specialists, and durable memory — for Claude Code. It routes each task to the least process that wins, then holds itself to a bar it can show you. Measured, not asserted.

6 Corrections logged Times the meter was wrong and the conductor was right — the receipts.
All passed Eval controls A/A stability + planted gold-vs-poor discrimination on the run store.
MIT · v0.1 License · maturity Pre-1.0, live and measured — honest about where it is.

The thesis

Dropping a powerful model on a hard task and hoping is not engineering.

A capable model will produce something. Whether it produced the right thing, at the right altitude, with the bar held — that is a separate question, and most setups never ask it. Agentry treats orchestration as engineering: the work is shaped to the least process that wins, built by specialists inside clean boundaries, and checked against an acceptance bar before it counts as done. And the system measures itself — so the bar isn't a claim, it's a number you can read.

How it holds together

Three parts, one discipline.

Orchestration, memory, and specialists — each doing one thing well, wired into a system that adapts to the work in front of it.

  • The brain — a right-sized conductor

    Every task is routed to the least process that wins: a one-line fix goes straight to an implementer; an under-specified goal gets a Spec and a gate first; a multi-file feature is planned, split into bounded contracts, and verified. No ceremony where none is earned — and no hoping where a gate is.

  • The memory — the moat that compounds

    Decisions, gotchas, and repo-facts are written to durable memory with provenance, then recalled — few, ranked, scoped — exactly when the next task needs them. The second run is warmer than the first: a problem solved spec-first once can route one-shot later, because the decision is remembered.

  • The harness — capability-first specialists

    Architect, implementer, verifier, designer, librarian — each a specialist with a single concern, dispatched into a clean boundary and held to its contract. They use whatever tools and skills you already have rather than a fixed allowlist, so the harness adapts to your environment instead of fighting it.

How it works

The router, on real tasks.

Pick a task — watch it route to the least process that wins.

/agentry:go "The date formatter drops the timezone — fix it."

routes to one-shot

Single symbol, reversible, no design fork — straight to the implementer in debugging mode.

trace
  1. Classify: one-symbol bug fix, no contract seam.
  2. No ambiguity, no decision to gate — skip spec & plan.
  3. Dispatch implementer: reproduce → fix → regression test.
  4. Verify against the repro; done.

The moat

Memory compounds.

Run the same task twice. The first time, Agentry has no prior decision, so it works spec-first — and writes down what it decided. The next time, that decision is recalled, and the same task routes one-shot. The system gets cheaper at the work it has already reasoned through.

First run · cold

"dedupe the import resolver"

spec-first

No prior decision. Shape the approach, gate it, build — then write the decision to memory with its provenance.

Next run · warm

"dedupe the import resolver"

one-shot

The recalled decision removes the unknown. The least process that wins is now a single bounded pass — no re-deriving what was already settled.

Proof

Show the receipts.

The thesis is "measured, not asserted" — so here is the meter being wrong, in the open. Every entry below is real tracked data, the same run store the eval dashboard reads.

See the evidence

The corrections log

6 corrections logged, and acted on — times the meter said one thing and the conductor was right. 3 would have shipped a wrong number; the rest were refinements.

  1. high

    Dispatch-pattern proxy was invalid

    meter said
    scored one-shot
    truth was
    conductor escalated via the AskUserQuestion gate (spec-first)
    fixed by
    replaced the proxy with artifact-based extraction — read plan/spec/tree, not the tool pattern
  2. high

    events.jsonl seam never existed

    meter said
    assumed a stream file
    truth was
    the benchmark used --output-format json; no such file
    fixed by
    switched shape extraction to the conductor's work-folder artifacts
  3. high

    Cap-timing deflated decompose → one-shot

    meter said
    scored one-shot
    truth was
    a slow planner was killed before it wrote plan.md
    fixed by
    artifact-aware early-terminate — poll the work folder, kill on the real signal
  4. refinement

    'Under-routes' were good decisions

    meter said
    flagged as failures
    truth was
    the lighter shape shipped 98%-quality specs
    fixed by
    quality-gate over under-routes — a shape-mismatch only counts if output quality suffered
  5. refinement

    Pagination mislabeled decompose

    meter said
    labeled decompose
    truth was
    spec-first was right — two independent labelers agreed
    fixed by
    re-labeled the task; the conductor had routed it correctly
  6. refinement

    Grace window deflates some decompose

    meter said
    45s grace
    truth was
    slow decompose→spec-first planners clipped at the boundary
    fixed by
    bump grace to 120s for the definitive run (queued)

The controls held.

The harness runs the checks that keep the meter honest: an A/A null test (the same task scored against itself should show no difference), and a planted-discrimination test (deliberately strong and deliberately weak runs should separate). In the tracked runs they came back as expected — the null test stayed null, and the planted cases separated. We report this qualitatively: it's a pre-1.0 harness on a small sample, so read it as "the meter behaves," not a headline metric.

We don't ask you to trust the routing — we show you where it was wrong and how we fixed the meter, not the verdict. Measured, not asserted. The receipts are public, the harness is reproducible, and the dashboard is live.

Agentry is v0.1, pre-1.0, MIT. The numbers are small and caveated on purpose.

Install

One copy-paste from installed.

Paste this into Claude Code. It adds Agentry from its GitHub repo — no account, no build step.

/plugin marketplace add Codestz/agentry
  • Requires Node ≥ 24 (the repo standard) and Claude Code.
  • Then run /agentry:go <task> — and Agentry routes it for you.