Adaptive agentic layer for Claude Code

The agentic layer that
measures itself.

Right-sized orchestration, SDLC specialists, and durable memory — for Claude Code. It routes each task to the least process that wins, then holds itself to a bar it can show you. Measured, not asserted.

Install Agentry See the evidence

6 Corrections logged Times the meter was wrong and the conductor was right — the receipts.

All passed Eval controls A/A stability + planted gold-vs-poor discrimination on the run store.

MIT · v0.1 License · maturity Pre-1.0, live and measured — honest about where it is.

.hero__out { opacity: 1 !important; transform: none !important; } .hero__cmd::after { content: '/agentry:go "Add pagination across the API, service and data layers"'; }

› classify: multi-file feature, contract seams across 3 layers › route: decompose + verify — least process that wins › architect → plan + bounded task contracts › implement in parallel → verify the assembled whole

The thesis

Dropping a powerful model on a hard task and hoping is not engineering.

A capable model will produce something. Whether it produced the right thing, at the right altitude, with the bar held — that is a separate question, and most setups never ask it. Agentry treats orchestration as engineering: the work is shaped to the least process that wins, built by specialists inside clean boundaries, and checked against an acceptance bar before it counts as done. And the system measures itself — so the bar isn't a claim, it's a number you can read.

How it holds together

Three parts, one discipline.

Orchestration, memory, and specialists — each doing one thing well, wired into a system that adapts to the work in front of it.

The brain — a right-sized conductor

Every task is routed to the least process that wins: a one-line fix goes straight to an implementer; an under-specified goal gets a Spec and a gate first; a multi-file feature is planned, split into bounded contracts, and verified. No ceremony where none is earned — and no hoping where a gate is.
The memory — the moat that compounds

Decisions, gotchas, and repo-facts are written to durable memory with provenance, then recalled — few, ranked, scoped — exactly when the next task needs them. The second run is warmer than the first: a problem solved spec-first once can route one-shot later, because the decision is remembered.
The harness — capability-first specialists

Architect, implementer, verifier, designer, librarian — each a specialist with a single concern, dispatched into a clean boundary and held to its contract. They use whatever tools and skills you already have rather than a fixed allowlist, so the harness adapts to your environment instead of fighting it.

How it works

The router, on real tasks.

Pick a task — watch it route to the least process that wins.

/agentry:go "The date formatter drops the timezone — fix it."

routes to one-shot

Single symbol, reversible, no design fork — straight to the implementer in debugging mode.

trace

Classify: one-symbol bug fix, no contract seam.
No ambiguity, no decision to gate — skip spec & plan.
Dispatch implementer: reproduce → fix → regression test.
Verify against the repro; done.

The moat

Memory compounds.

Run the same task twice. The first time, Agentry has no prior decision, so it works spec-first — and writes down what it decided. The next time, that decision is recalled, and the same task routes one-shot. The system gets cheaper at the work it has already reasoned through.

First run · cold

"dedupe the import resolver"

spec-first

No prior decision. Shape the approach, gate it, build — then write the decision to memory with its provenance.

Next run · warm

"dedupe the import resolver"

one-shot

The recalled decision removes the unknown. The least process that wins is now a single bounded pass — no re-deriving what was already settled.

Proof

Show the receipts.

The thesis is "measured, not asserted" — so here is the meter being wrong, in the open. Every entry below is real tracked data, the same run store the eval dashboard reads.

See the evidence

The corrections log

6 corrections logged, and acted on — times the meter said one thing and the conductor was right. 3 would have shipped a wrong number; the rest were refinements.

high Jun 16, 2026

Dispatch-pattern proxy was invalid

meter said

scored one-shot

truth was

conductor escalated via the AskUserQuestion gate (spec-first)

fixed by

replaced the proxy with artifact-based extraction — read plan/spec/tree, not the tool pattern
high Jun 16, 2026

events.jsonl seam never existed

meter said

assumed a stream file

truth was

the benchmark used --output-format json; no such file

fixed by

switched shape extraction to the conductor's work-folder artifacts
high Jun 15, 2026

Cap-timing deflated decompose → one-shot

meter said

scored one-shot

truth was

a slow planner was killed before it wrote plan.md

fixed by

artifact-aware early-terminate — poll the work folder, kill on the real signal
refinement Jun 16, 2026

'Under-routes' were good decisions

meter said

flagged as failures

truth was

the lighter shape shipped 98%-quality specs

fixed by

quality-gate over under-routes — a shape-mismatch only counts if output quality suffered
refinement Jun 15, 2026

Pagination mislabeled decompose

meter said

labeled decompose

truth was

spec-first was right — two independent labelers agreed

fixed by

re-labeled the task; the conductor had routed it correctly
refinement Jun 17, 2026

Grace window deflates some decompose

meter said

45s grace

truth was

slow decompose→spec-first planners clipped at the boundary

fixed by

bump grace to 120s for the definitive run (queued)

The controls held.

The harness runs the checks that keep the meter honest: an A/A null test (the same task scored against itself should show no difference), and a planted-discrimination test (deliberately strong and deliberately weak runs should separate). In the tracked runs they came back as expected — the null test stayed null, and the planted cases separated. We report this qualitatively: it's a pre-1.0 harness on a small sample, so read it as "the meter behaves," not a headline metric.

We don't ask you to trust the routing — we show you where it was wrong and how we fixed the meter, not the verdict. Measured, not asserted. The receipts are public, the harness is reproducible, and the dashboard is live.

Agentry is v0.1, pre-1.0, MIT. The numbers are small and caveated on purpose.

Install

One copy-paste from installed.

Paste this into Claude Code. It adds Agentry from its GitHub repo — no account, no build step.

$/plugin marketplace add Codestz/agentry

Requires Node ≥ 24 (the repo standard) and Claude Code.
Then run /agentry:go <task> — and Agentry routes it for you.

The agentic layer that measures itself.