We used ATO to test ATO — Part 6. The methodology runner ships, and we ran it against the n=150 corpus from Part 5.
- Part 1 — v2.9 PR-1: grounded-mode foundation, the regression we found.
- Parts 2–4 — v2.9 PR-2/3/4: close the false negative, route API providers, refuse-with-options.
- Part 5 — n=150 scaled re-run; one Part 4 claim falsified; methodology runner spec.
- Part 6 (you are here) — v2.10 PR-2/3/3.1/4 shipped; runner real-data validated on 170 receipts.
Part 5 ended with: “The methodology runner the v2.10 PR-1 implements will run this loop for customers.” This morning we shipped PR-2 through PR-4 of v2.10 — the CLI surface, the fan-out runner, the rubric library, and the missing piece that made the whole thing real: adopt, which lets the runner compose methodologies over existing execution_logs rows without re-dispatching. Then we ran the runner against the 150 receipts from Part 5 plus a fresh 13-receipt sub-eval for the LLM-judge path. Total additional spend to do this entire Part 6 validation: $0.07. The blog post you’re reading is what the receipts say.
What shipped
Four PRs went out in the last 36 hours. The methodology runner is the v2.10 release’s headline Pro feature.
| PR | What | LOC |
|---|---|---|
v2.10 PR-1 |
Schema (methodologies, methodology_runs, methodology_run_dispatches with dual cost accounting columns) + Rust types + cost estimator + open-source pricing.json rate card |
~750 |
v2.10 PR-2 |
ato evaluations methodology create / list / get / archetypes / cost-estimate — the CRUD surface + the pre-run dual-cost estimate |
~660 |
v2.10 PR-3 |
Fan-out runner + composer: variant matrix expansion → sequential ato dispatch shell-outs → per-cell mean/sd/95% CI + pairwise Welch t. runs list/show subcommands |
~1340 |
v2.10 PR-3.1 |
adopt — compose methodologies over EXISTING execution_logs without re-dispatching. Variant cell derived from the row. Made Part 6 possible at $0 incremental LLM spend. Plus runtime override on VariantMatrix for CLI-vs-API head-to-heads |
~340 |
v2.10 PR-4 |
Rubric library — regex / structural / llm_judge / composite + the score subcommand. LLM-judge cost lands in provider_judge_cost_usd. Brace-balanced JSON extraction tolerates judges that wrap their score in prose |
~1000 |
All four PRs live on main. The pre-commit gate (cargo check + tsc + 133 vitests + cargo unit tests + 18 CLI smoke commands) caught nothing new on either commit.
Validation step 1: adopt the n=150 corpus
Part 5’s 150 dispatches sit in ~/.ato/local.db’s execution_logs table. The runner doesn’t need to re-fire them — that’s exactly what adopt exists for. We created a methodology and adopted every claude-sonnet-4-6 success row from 2026-05-24:
ato evaluations methodology create --config part5.json --human
# Created methodology 'part5-real-150-eval' (294e0499-…). Variant matrix:
# 1 prompts × 1 models × 1 conditions × 30 reps = 30 dispatches per run.
ato evaluations methodology adopt part5-real-150-eval \
--since 2026-05-24 \
--runtime claude \
--model claude-sonnet-4-6 \
--status success \
--limit 200 \
--human
# Adopted 157 execution_logs rows into methodology run da25a236-…
# methodology: part5-real-150-eval
# distinct prompts: 21
# YOUR cost: $6.1985
# YOUR tokens: 9837 in / 411265 out
# OUR cost: $0.0496
# margin (est): $0.2404
The dual cost accounting is the load-bearing number to read here. The customer’s LLM bill for the underlying 157 dispatches was $6.20. Our cost to deliver the methodology run on top of it — storage, bandwidth, orchestrator compute — was $0.05. That’s a 124× margin on top of the $0.29 per-run allocation a $29 Pro tier covers. Pricing transparency lives in packages/ato-pricing/pricing.json — same file the cost estimator reads, same numbers in the receipt.
Validation step 2: regex rubric over the adopted run
Set a security-keyword regex on the methodology — matches if the response mentions vulnerability / injection / race / leak / secret / exposure / hardcoded / CVE. Then score every dispatch in the adopted run:
ato evaluations methodology score da25a236-… --human
# Scored 157 dispatches in run da25a236-…
# mean score: 0.465
# passed (≥0.5): 73/157
# judge cost: $0.0000
73 of 157 dispatches mention at least one security keyword. The runs show command then breaks that down by variant cell. The cells where n is high enough to matter (industry-baseline-mid, n=10 per cell):
| prompt | condition | n | mean score | passed ≥0.5 | cost mean |
|---|---|---|---|---|---|
| [15] | advisory | 10 | 0.700 | 7/10 | $0.0440 |
| [15] | default | 11 | 0.636 | 7/11 | $0.0060 |
| [16] | advisory | 10 | 0.300 | 3/10 | $0.0780 |
| [16] | default | 10 | 0.600 | 6/10 | $0.0042 |
| [17] | advisory | 10 | 1.000 | 10/10 | $0.1007 |
| [17] | default | 10 | 0.900 | 9/10 | $0.0043 |
| [18] | advisory | 10 | 0.000 | 0/10 | $0.0550 |
Two things to read out of this. First, advisory mode isn’t always better. Prompt [16] under advisory grounding scored 0.300 (3/10 mentioning security keywords); the same prompt without grounding scored 0.600 (6/10). The more expensive tooled-up dispatch was rubric-worse on this prompt. That’s a real product finding, surfaced by the runner on receipts the customer already has. Second, prompt [18] under advisory grounded mode failed every dispatch (0/10). The model burned $0.055 per call, took 13.6 seconds on average, and not one of those dispatches mentioned a security keyword the rubric could find. If we’d shipped a scheduled regression-watch on this prompt, this is exactly the kind of alert the customer would want.
Validation step 3: LLM-judge rubric, 13 dispatches, $0.07
Regex catches keywords. LLM-judge evaluates whether the response is actually good. We built a second methodology specifically to test the judge path on the 13 gemini-2.5-flash receipts in the same corpus:
{
"slug": "part5-llm-judge",
"archetype": "custom",
"variant_matrix": {
"prompts": ["(adopted)"],
"models": ["gemini-2.5-flash"],
"conditions": ["default"],
"reps_per_cell": 13
},
"rubric": {
"kind": "llm_judge",
"judge_model": "claude-sonnet-4-6",
"judge_runtime": "claude",
"prompt_template": "Rate this AI agent’s response on a 0..1 scale. The agent was asked to perform a security or code review task. Reply ONLY with a JSON object: {\"score\": <float>, \"reason\": \"...\"}.\n\nOriginal prompt: {{prompt}}\n\nResponse to evaluate: {{response}}"
}
}
Adopt the 13 gemini rows. Then score — this is the path that costs money, because every dispatch fans out a judge call:
ato evaluations methodology score 34616b2f-… --human
# Scored 13 dispatches in run 34616b2f-…
# mean score: 0.669
# passed (≥0.5): 10/13
# judge cost: $0.0712
$0.07 for 13 LLM-judge evaluations — roughly half a cent per judged dispatch through the claude CLI subscription path. The scores are differentiated rather than binary:
- prompt [8] default: 0.850 — the judge thought this was a strong answer
- prompt [6] default: 0.820
- prompt [9] default: 0.300
- prompt [10] advisory: 0.150
- prompt [10] default: 0.050 — the judge thought this was nearly useless
This is the thing customers actually want. Not “your regex matched.” Not “your dispatch completed in 8 seconds.” A second LLM read the response and said: that one was good, that one was bad, here’s why. The 0..1 scale composes cleanly into the runner’s existing mean / SD / 95% CI math.
How the runner makes a real product decision
The three demoes above were all on existing receipts, $0 incremental customer cost. What the runner is built to do — what the Pro tier sells — is the active version of this loop. Three first-class archetypes ship in v2.10:
1. “Should Agent X stay on Opus 4.7, or can Sonnet 4.6 handle this?”
The model-ladder archetype. N models, M prompts that represent the agent’s real work, R repetitions per cell. Score every dispatch through a rubric. The output is a cost-quality Pareto frontier: at what quality threshold can you drop from Opus to Sonnet on this specific agent? Pre-run cost estimate from PR-2 tells you what the full sweep will cost before you commit.
ato evaluations methodology create --config model-ladder.json
ato evaluations methodology cost-estimate model-ladder --judge-calls 1 --human
# Variant matrix: 8 prompts × 4 models × 1 condition × 30 reps = 960 dispatches
# YOUR estimated LLM spend (billing=byok):
# claude-opus-4-7 240 dispatches $76.86
# claude-sonnet-4-6 240 dispatches $15.37
# gemini-2.5-pro 240 dispatches $5.13
# gpt-5 240 dispatches $36.00
# YOUR total: $133.36
# OUR cost to deliver: $1.21
# Tier fit: ExceedsTier
The customer sees $133 in their face before committing. They can downscope to 3 models, or to n=10 instead of n=30. The OS rate card is open-source so they can audit the math.
2. “Does grounded mode actually change behavior on MY prompts?”
The tools-vs-no-tools archetype. Same prompts, same model, with vs without grounding. This is exactly what Part 5 tested by hand — the runner now does it automatically. From the n=150 data: prompt [16] is rubric-worse under grounding; prompt [17] is rubric-better. Both findings would have shipped in Part 5 if the runner had existed; the methodology is the runner’s primary archetype.
3. “Did our agent get worse this week?”
The regression-watch archetype. Re-run any methodology on a schedule. Diff this week’s scores against last week’s. The runner’s composer ships pairwise Welch t-statistics + a CI-disjoint heuristic for “is this difference real,” so a 0.85 → 0.62 drop with overlapping CIs doesn’t fire an alert but a 0.85 → 0.30 disjoint drop does.
Update 2026-05-25 PM — cross-model data at industry-baseline n=30
The original Part 6 above was claude-only. Reason: a dev-binary keychain ACL on macOS prevented the methodology runner from decrypting stored API provider keys, so we ran everything through claude CLI runtime (which uses Claude Code’s own session, not ATO’s keychain). I wrote that off as “documented limitation” in earlier drafts. Will read this and corrected it: the prod app bundle’s keychain works fine; the runner just needed to be told to delegate to the prod binary.
v2.11 PR-12.6 ships the fix — an ATO_CLI_PATH env override that points the methodology runner’s shell-out at any installed ato binary (mirroring the Tauri cron module’s pattern that already worked). One line of env, and a dev-build ATO reaches every API provider through the prod-app-bundle’s keychain. Same fix unblocks the diagnose dispatch + the LLM-judge rubric.
Then we re-fired the Part 5 prompts through gemini-2.5-flash at industry-baseline n=30 per prompt (150 dispatches, $0.56 customer / $0.11 ours, 84 minutes). Apples-to-apples vs. claude-sonnet-4-6 n=30:
| Prompt | claude n=30 score | gemini-flash n=27-30 score | claude $/call | gemini $/call | $ ratio |
|---|---|---|---|---|---|
| src/auth.ts SQL injection | 0.533 | 1.000 (n=28) | $0.0311 | $0.0042 | 7.4× |
| src/session.ts race conditions | 0.467 | 1.000 (n=27) | $0.0539 | $0.0072 | 7.5× |
| src/billing.ts test coverage | 0.000 | 0.000 (n=30) | $0.0382 | $0.0016 | 24× |
| src/config.ts env leaks | 0.900 | 1.000 (n=29) | $0.0632 | $0.0047 | 13× |
| src/server.ts HTTP endpoints | 0.000 | 0.000 (n=30) | $0.0090 | $0.0018 | 5.0× |
(n=27/28/29 instead of 30 on three prompts is the realistic-LLM error tail — a few requests timed out or returned 429s under the sustained load. Errored dispatches are still recorded; the cost ledger captures the partial-token charges.)
What the numbers actually say at n=30
- Gemini-2.5-flash scores 1.000 on every real security prompt at 7-13× lower cost than claude-sonnet-4-6. Not a small-sample artifact — n=27 to 30 per cell with zero variance in the rubric score on three independent security prompts is statistically locked. On these prompts under this rubric, gemini-flash is the Pareto-better choice. The conventional wisdom (“claude is the higher-quality model”) doesn’t hold for this workload at this price point. For these prompts under this rubric — the methodology-runner-shaped caveat that should follow every claim of this shape.
- The two Goodhart’s-law cells (billing.ts, server.ts) fail with identical 0.000 score on BOTH models. Cleanest possible confirmation that the failure mode is the rubric, not the agent. The diagnose pass we ran flagged this in
risks_flaggedback when we only had claude data; the cross-model n=30 evidence is now in. - Cost gap widens with prompt complexity. The simplest prompts (src/server.ts endpoint enumeration) show 5× cost ratios; the security audits with longer responses show 7-24×. Customers running methodologies over heavier prompts capture more of gemini-flash’s cost advantage.
The transparency move worth flagging: the original Part 6 had only claude data because I hadn’t plumbed the keychain workaround through. Will caught it, the fix is one new module + four call-site swaps, and the methodology runner code that produced both columns ships in c774584 + 91eab43. That is what we mean by “the receipts are real.”
Production-relevant finding the methodology surfaced about the diagnose model itself
While running the cross-model fan-out, we also re-fired the diagnose pass through gemini to see whether it could replace claude-sonnet-4-6 as the diagnose agent (the “what-change-would-have-improved-this-cell” meta-LLM that drives the v2.11 learning loop):
- claude-sonnet-4-6: $0.019/call, 742-771 tokens out, parseable JSON every time, with the Goodhart's-law warning unprompted in
risks_flagged. - gemini-2.5-pro: $0.006/call, 2,380 tokens out, excellent prose analysis (correctly identified race conditions in src/session.ts with code-level specificity), but ignored the strict JSON schema entirely. The diagnose pipeline’s structured-output contract failed every call.
- gemini-2.5-flash: dispatch errored with empty response on every attempt against this prompt length. Not viable as a diagnose model.
The product implication: the diagnose pipeline currently requires claude as the model. Other frontier models can reason about the failure (gemini-pro’s prose was genuinely good) but they don’t reliably wrap the answer in the contract the apply pipeline needs. Defaulting --diagnose-model to claude-opus-4-7 in docs/v2.11-learning-loop.md §Q3 wasn’t arbitrary; cross-model dogfood validates it. Customers running on their own dime should pick claude for the diagnose step regardless of which model they’re evaluating.
Honest gaps (and what fills them)
- No p-values yet. The composer surfaces Welch’s t-statistic and Satterthwaite df. The proper p-value calculation needs the incomplete-beta CDF; we made a deliberate call to ship the CI-disjoint heuristic instead while the runner finds its footing. PR-5 may bring the full p computation in.
- Dev-build keychain — FIXED in v2.11 PR-12.6 (2026-05-25). The original gap (dev binary couldn’t decrypt API keys the prod desktop wrote) is closed: the methodology runner now delegates keychain-bound dispatches to the prod app bundle via
ATO_CLI_PATH+ a fallback chain incli_path::resolve_ato_binary. The cross-model n=30 update above (gemini-2.5-flash, $0.56 customer / $0.11 ours) is the empirical proof. Customers running the production binary never hit the original issue. - PR-5 dual-cost calibration + admin margin reports — SHIPPED in v2.10 (2026-05-25).
ato evaluations methodology marginis the customer-facing margin transparency surface (now Pro per Part 7’s open-core boundary call);ato evaluations methodology calibrate setwrites per-deployment rate overrides into~/.ato/rate-card-override.json. The $0.05 OUR cost / $0.29 margin per run was the spec default at publish; live numbers calibrate from the actual Railway month viacalibrate. - n=10 per cell remains industry-baseline-mid, not top-bar. Same gap Part 5 called out. The runner makes n=30 trivial to fire; what’s missing is the customer-facing UI panel that turns the scored receipts into a board-meeting chart. v2.10 PR-5 closes part of that.
How to reproduce this on your own machine
brew install willnigri/ato/ato
# (or download from agentictool.ai)
# 1. Take any existing execution_logs rows you have
sqlite3 ~/.ato/local.db "SELECT COUNT(*), runtime, model
FROM execution_logs
WHERE created_at > date('now', '-7 days')
GROUP BY runtime, model;"
# 2. Define a methodology with a regex rubric
cat > my-methodology.json <<EOF
{
"slug": "weekly-quality-check",
"archetype": "regression-watch",
"variant_matrix": {
"prompts": ["(adopted)"],
"models": ["claude-sonnet-4-6"],
"conditions": ["default"],
"reps_per_cell": 30
},
"rubric": {
"kind": "regex",
"pattern": "(?i)(your|expected|markers|here)",
"case_insensitive": true
}
}
EOF
ato evaluations methodology create --config my-methodology.json --human
# 3. Adopt your existing receipts into a run
ato evaluations methodology adopt weekly-quality-check \
--since "$(date -v-7d +%Y-%m-%d)" \
--runtime claude \
--status success \
--human
# 4. Score (regex rubric = free, llm_judge = $0.005-ish per dispatch)
ato evaluations methodology score <run-id> --human
# 5. See the per-cell composition
ato evaluations methodology runs show <run-id> --human
The whole loop is < 5 minutes against any week’s receipts. The expensive parts — running new dispatches, calling LLM judges — are explicit opt-ins per the runner’s “no surprise spend” rule.
Why we can ship this with our face on it
Part 5 retracted one of our own n=1 claims with the n=10 data. Part 6 demonstrates the runner against that same retracted corpus. Both findings get persisted into the same SQLite file the customer queries. The receipts for everything in this post live in ~/.ato/local.db — methodology id 294e0499-2729-4397-a177-986b80c57aac (regex run), 9da277ef-749a-4ad2-8fe6-c771a163809f (LLM-judge run) — queryable, reproducible, public.
If you sell an eval product, you have to use it on your own product. The runner is what we use to make decisions. That’s why it’s the Pro tier.
Download ATO → Read the Methodology Runner spec →
Real data behind this post: 157 claude-sonnet-4-6 receipts (the n=150 corpus from Part 5) + 13 gemini-2.5-flash receipts + 13 new claude-sonnet-4-6 judge dispatches fired 2026-05-25 to validate the LLM-judge path. Total incremental spend for this Part 6: $0.0712. Methodology rows + dual cost ledger in ~/.ato/local.db:methodologies, methodology_runs, methodology_run_dispatches. The code: apps/cli/src/methodology/ (~3,300 LOC for the runner + composer + rubric library) and apps/cli/src/commands/methodology.rs (CLI surface). Open-source pricing rate card at packages/ato-pricing/pricing.json — same file the cost estimator reads.