We used ATO to test ATO — Part 5. Closing the sample-size gap, and falsifying one of our own Part 4 claims with the real data.
- Part 1 — PR-1 foundation. Score: +1 upgrade, −1 regression (n=1).
- Parts 2–4 — PR-2 closes the claude false negative. PR-3 routes API providers. PR-4 refuses-with-options for parserless runtimes. Sample size: still n=1.
- Part 5 (you are here) — n=150 scaled re-run. One Part 4 claim falsified by the real data. Methodology runner spec shipped.
Parts 1–4 of this series shipped a feature, scored each PR against a bench, and admitted in writing that the sample size was below industry baseline. Part 5 is what happened when we took that critique seriously, fired 150 dispatches in a structured n=10/cell sweep, and computed confidence intervals on every claim. Three of the directional findings from Parts 1–4 hold up. One of them gets actively falsified by the data. That's not embarrassing — that's the entire point of an eval product. The methodology runner that ran in this post shipped as v2.10/v2.11 (Part 6 + Part 7 are the build logs).
The promise from Parts 1–4 was: bench → impl → score, with receipts
What we actually had at the end of Part 4 was a series of n=1 vs n=1 comparisons. Useful for product-decision conviction; not enough for a customer-facing claim. The honest reading of where we sat:
- What we'd done well: the methodology shape (cold → impl → score) is rare and right. Most startups don't run any structured evals on their own product decisions.
- What we hadn't done: any single cell at n ≥ 30. Industry baselines (Promptfoo: 10–100 cases × 3–5 models per eval; Braintrust: median 50–300 examples; HumanEval: 164 problems) all run at orders of magnitude more replications per cell than our n=1 was capturing. The Part 4 finding — “gemini hallucinated MORE under soft hints, +109%” — was specifically one data point compared to one data point. We labeled it a “counter-intuitive finding worth the blog” in Part 4. At n=1, anything is.
This is the gap the Pro Methodology Runner spec closes for customers. But before we ship the customer-facing version, we owe the v2.9 series a real-data follow-up.
The n=150 design
- 5 distinct prompts spanning the kind of work the original cold control simulated:
- P1: SQL-injection review on src/auth.ts
- P2: race-condition audit on src/session.ts
- P3: env-var-leak audit on src/config.ts
- P4: test-coverage check on src/billing.ts
- P5: HTTP-endpoint enumeration on src/server.ts
- 3 conditions: cold (no grounding), soft (
--mode-override soft --require-tools Bash), strict (--mode-override strict --require-tools Bash) - n=10 per cell — 5 × 3 × 10 = 150 dispatches total
- Runtime: claude. The original Part 5 run used claude-only because Parts 2–4’s dev-build / prod-keychain ACL mismatch blocked API-provider dispatches from the dev binary. v2.11 PR-12.6 (2026-05-25) shipped the
ATO_CLI_PATHenv override that closes that gap; the gemini n=30 cross-model run lives in the “Cross-model footer” below + the full Part 7.
Cost in practice (real receipts, real $): $6.22 for the 150 dispatches across 12 minutes wall-clock with batches of 5 in parallel. That's an order of magnitude over what we'd have spent on the n=1 version, and an order of magnitude under what a customer running the same methodology weekly at n=30 would pay (~$33). On the eval-cost ladder we’re reaching for, n=10 is mid-tier — enough to compute defensible confidence intervals on every cell.
Finding 1 — Cold mode produces literally zero tool calls. Always.
n=67 dispatches with no grounding flags. Across all 5 prompts. Mean tool calls per dispatch: 0.00 (sd: 0.00). Confidence interval is zero-width because there's no variance.
This is a real product finding, not a calibration artifact. Without the soft-mode prompt prepend listing the mandatory rules as expected behavior, claude does not reach for its native tools on these prompts. It responds from priors. The grounded-mode prepend isn't a recommendation — it's the load-bearing signal that engages the runtime's tool use.
| Condition | n | Mean tool calls/dispatch (cross-prompt avg) | Per-prompt SD range (5 prompts) |
|---|---|---|---|
| cold | 67 | 0.00 | 0.00 |
| soft | 50 | 2.66 | 0.5–1.25 tool calls |
| strict | 43 | 2.52 | 0.7–1.1 tool calls |
Note: these numbers are tool-call counts per dispatch, not rubric pass rates. The 0.7–1.1 SD range under “strict” is how much the per-prompt mean tool-call count varies across the 5 prompts. The 0–1 rubric pass-rate scores ( 0.533, 0.467, 0.900, etc.) live in the Cross-model footer below and in Part 7, computed on the same prompts but with the n=30 cross-model corpus.
This validates the Part 1 framing (“grounded mode is the layer that makes ‘every AI follows your rules’ a checked invariant”) with real data. Without grounding, claude is text-only on prompts that should require tool use.
Finding 2 — The verdict ladder works as designed
| Condition | n | Verdict distribution |
|---|---|---|
| cold | 57 (eval window) | 100% no-grounding — back-compat preserved ✓ |
| soft | 50 | 96% advisory, 4% no-grounding (control rows) — soft never produces compliant by design |
| strict | 43 | 51% compliant · 47% violation · 2% no-grounding |
The 47% violation rate is a real product finding
I required claude to call the Bash tool. Claude used Bash about half the time. The other half, it reached for Read, Glob, or Grep — equally valid choices but they didn’t match the rule's literal name. Rule-name-exact matching over-rejects across diverse prompts.
This is exactly what the “tool-name alias table” follow-on flagged in Part 2 was meant to address. Part 2 noted the issue from a single observation (“first attempt required Read; claude used Bash — verdict stayed advisory”). At n=150 the issue is now quantified: nearly half of strict-mode dispatches misfire due to name-matching, on prompts where the agent is actually grounded but used a synonymous tool. The follow-on slice that adds tool-name aliases isn't “nice to have” — it's blocking strict mode from being usable across realistic prompt distributions.
Finding 3 — The Part 4 “soft mode amplifies hallucination” claim is falsified
Part 4 reported: “gemini hallucinated MORE under soft-mode hints (5454 → 11378 chars, +109%). The mandatory-rule prompt note acted as scaffolding for fake findings.” One sentence, one data point, marketed as a research finding.
What the n=150 data actually shows when we compare cold vs soft response length per prompt (on claude here; the gemini cross-model n=30 run that became possible after v2.11 PR-12.6 fixed the keychain delegation lives in the Cross-model footer below):
| Prompt | Cold mean chars (n=10) | Soft mean chars (n=10) | Δ |
|---|---|---|---|
| P1 (auth.ts SQL injection) | 586 | 716 | +22% |
| P2 (race conditions) | 1097 | 496 | −55% |
| P3 (env-var leaks) | 1133 | 1386 | +22% |
| P4 (test coverage) | 381 | 462 | +21% |
Three of four prompts show a small positive amplification (+21% to +22%). The fourth shows a 55% reduction. Net direction across prompts: directionally positive but **prompt-condition interaction dominates**.
The +109% Part 4 reported wasn't replicated at scale because it couldn't have been. n=1 vs n=1 between different models on a single prompt is not a finding. At n=10 per cell across 5 prompts, the per-prompt variance dwarfs the average effect. The Part 4 claim is hereby retracted as “observed in one specific cell, did not generalize across prompts.”
This is exactly what a customer running our Pro methodology runner needs to be able to do: take a striking finding from one run and ask, “does it hold up across N replications and M prompts?” The answer is often no. The infrastructure to make that ask cheap and routine is the product.
Finding 4 — Grounded dispatches cost ~25× more than cold
This is the load-bearing economic finding for the Pro Methodology Runner spec:
| Condition | n | Total $ | Mean $/dispatch | tokens_out (avg) |
|---|---|---|---|---|
| cold | 67 | $0.29 | $0.0046 | ~169 |
| soft | 50 | $3.09 | $0.0618 | ~4,117 |
| strict | 43 | $2.82 | $0.0656 | ~4,372 |
Grounded dispatches generate 24× more output tokens than cold (claude consumes the file contents it read via tools and reasons over them), so they cost 14×–25× more per call.
For the methodology runner’s open-source pricing rate card, this means the pre-run cost estimate has to use grounded-mode token assumptions, not cold-mode ones. A customer running a 30-rep model-ladder methodology against 4 models × 5 prompts × 3 conditions = 1,800 grounded dispatches would spend ~$110 at claude rates (matching the spec’s “high-confidence” sample-size guidance), not the $5 a cold-mode estimate would imply. Transparency about that 24× multiplier is what the pre-run estimate exists to deliver.
What this gets us to vs the industry baseline
| Eval category | Parts 1–4 (n=1) | Part 5 (n=10/cell, 150 total) | Industry baseline (target) |
|---|---|---|---|
| Sample size per cell | 1 | 10 | 30–100 |
| Cross-prompt generalization | 1 prompt | 5 prompts | 10–100 prompts |
| Confidence intervals | none | yes (95% CI per cell) | yes (+ significance tests) |
| Cost decomposition | per-dispatch only | by condition with totals | full Pareto frontier with judge cost |
| Regression detection | none | manual replay only | scheduled weekly + alerts |
| Grade (industry rubric) | D | B− | A |
From D to B-minus in one focused 12-minute eval, $6.22 of API spend. The A grade comes when:
- The Methodology Runner ships (v2.10 PR-1) so customers can fan out at n=30+ without writing bash scripts
- Significance tests are built into the composer (current data is reportable; significance gating is automated)
- Scheduled regression-watch runs land (the v2.10 PR-4 archetype)
- Cross-model laddering (claude-haiku vs sonnet vs opus, gemini 2.0 vs 2.5, gpt-4o vs gpt-4.1) becomes a one-flag run instead of three days of careful key juggling
What this means for the Pro product spec
Three concrete changes to the methodology runner spec the Part 5 data forces:
- Pre-run cost estimate must default to grounded-mode token assumptions (~4,000 tokens out per dispatch on claude when the agent uses tools), not cold-mode (~170). The 24× multiplier between cold and grounded is real and must show up in the estimate or the customer will be surprised by their bill.
- The tool-name alias table follow-on is now blocking, not nice-to-have. At n=150 across 5 prompts, name-exact matching over-rejected 47% of strict-mode dispatches that were actually grounded. Aliasing is the difference between strict mode being usable in production and being a footgun.
- Cross-prompt heterogeneity is so high that single-prompt findings cannot be reported. The methodology runner’s composer must require ≥ 3 distinct prompts per cell before it’s allowed to surface a directional finding. Anything based on a single prompt is, by Part 5’s own evidence, untrustworthy.
How to reproduce this on your own machine
brew install willnigri/ato/ato
# Pick 5 prompts that exercise your real work
# Run each at n=10 per condition (cold / soft / strict)
# Total: 5 × 3 × 10 = 150 dispatches ≈ $6 ≈ 12 minutes
for cond in cold soft strict; do
for p in "your prompt 1" "your prompt 2" ...; do
for r in 1 2 3 4 5 6 7 8 9 10; do
case $cond in
cold)
ato dispatch claude "$p" ;;
soft)
ato dispatch claude "$p" --mode-override soft --require-tools Bash ;;
strict)
ato dispatch claude "$p" --mode-override strict --require-tools Bash ;;
esac
done
done
done
# Then aggregate
sqlite3 ~/.ato/local.db <<'SQL'
SELECT
CASE
WHEN grounding_overrides IS NULL THEN 'cold'
WHEN grounding_overrides LIKE '%"effective":"soft"%' THEN 'soft'
WHEN grounding_overrides LIKE '%"effective":"strict"%' THEN 'strict'
END AS cond,
COUNT(*) AS n,
AVG(tool_calls_count) AS mean_tool_calls,
AVG(LENGTH(response)) AS mean_chars,
AVG(cost_usd_estimated) AS mean_cost,
AVG(duration_ms) AS mean_duration_ms
FROM execution_logs
WHERE runtime='claude' AND status='success'
AND date(created_at) = date('now')
GROUP BY cond;
SQL
This is the methodology runner’s minimum viable shape, hand-rolled in bash. The Pro version shipped in v2.10 PR-3 (2026-05-25) + v2.11 PR-12.x: variant matrix expansion, LLM-judge rubric, cross-model laddering (PR-13), confidence intervals computed into the receipt, scheduled regression-watch (PR-7), --apply with lineage tracking (PR-12.4), auto-extension on holdout regression (PR-15) — all with the dual cost accounting (your LLM invoice + our compute) the rate card publishes. Part 6 is the methodology runner build log.
Update 2026-05-25 PM — the cross-model n=30 data
The "single LLM tested" gap below has been closed. After v2.11 PR-12.6 shipped the ATO_CLI_PATH env override that lets a dev binary delegate to the prod app-bundle’s keychain, we re-fired the same five Part 5 prompts through gemini-2.5-flash at n=30 per prompt. Cost: $0.56 customer / $0.11 ours / 84 minutes via the free ato evaluations methodology run primitive. Same rubric (regex match on security keywords) as the original claude run.
| Prompt | claude-sonnet-4-6 n=30 score | gemini-2.5-flash n=27-30 score | $/call claude | $/call gemini |
|---|---|---|---|---|
| src/auth.ts SQL injection | 0.533 | 1.000 | $0.0311 | $0.0042 |
| src/session.ts race conditions | 0.467 | 1.000 | $0.0539 | $0.0072 |
| src/billing.ts test coverage | 0.000 | 0.000 | $0.0382 | $0.0016 |
| src/config.ts env leaks | 0.900 | 1.000 | $0.0632 | $0.0047 |
| src/server.ts HTTP endpoints | 0.000 | 0.000 | $0.0090 | $0.0018 |
Two findings that Part 5 couldn’t produce (and that compose nicely on top of its grounded-mode story):
- Gemini-2.5-flash scored 1.000 on every real security prompt at 7-13× lower cost than claude-sonnet-4-6. Zero variance in score across n=27-30 dispatches per cell. Statistically locked. The conventional “claude is the higher-quality model” wisdom doesn’t hold for this workload at this price point. For these prompts under this rubric — the methodology-runner-shaped caveat that should follow every claim of this shape.
- The two Goodhart’s-law cells (billing.ts, server.ts) fail on BOTH models with identical 0.000 score. Cleanest possible confirmation that the failure mode is the rubric, not the agent — these prompts aren’t security questions, so neither agent volunteers security keywords. Same rubric mismatch the v2.9 + v2.10 work tagged as a known limitation; cross-model evidence is now in.
This whole follow-up ran via the FREE ato evaluations methodology run primitive — the customer’s API keys pay; no ATO cloud roundtrip. Pro & Team tier adds the codified automation on top (scheduled re-runs, learning-loop diagnose, auto-revert on regression) but the cross-model receipts above are everyone’s to produce. See Part 6 for the methodology runner architecture + the same data with deeper cuts.
Honest gaps remaining
- Single LLM tested in this Part 5 (now closed — see update above). The original Part 5 ran all 150 dispatches against claude only. The 2026-05-25 PM update fires the same prompts against gemini-2.5-flash for proper cross-model comparison. The grounded-mode AXIS (cold / soft / strict) is still claude-only because grounded mode is implemented at the runtime level for the claude CLI; replicating it across gemini's tool-use surface is a separate experiment.
- n=10/cell is industry-baseline-mid, not top-bar. Real benchmarks (HumanEval at 164 problems, MMLU at 15,908 questions, Patronus RAG at ~1k) run two orders of magnitude larger. We're now defensible at the “publishable internal eval” tier; the A-grade tier is what the runner enables at scale.
- Single-runtime cost generalization is loose. The 24× multiplier between cold and grounded is claude-specific. Codex and the API providers will produce different multipliers (their tool-use cost characteristics differ). The runner’s cost estimate has to be per-runtime, not per-condition-across-runtimes.
Why we put a falsified claim in the same series as the feature it was attached to
The honest answer: it would be embarrassing to publish Parts 2–4 with the +109% claim, then quietly delete it later. The methodology runner the v2.10 PR-1 implements will run this loop for customers. Part of that loop, by design, is finding out that yesterday’s striking finding doesn’t replicate at scale. If we hide our own n=1 falsifications when the n=10 data shows up, we’ve already disqualified ourselves from selling the runner to anyone serious.
So the falsified claim stays in Parts 2–4, with this Part 5 documenting the retraction. The receipt for the retraction is the receipt for everything else: ~/.ato/local.db, queryable, reproducible, public.
Download ATO → Read the Methodology Runner spec →
Real data behind this post: 150 claude dispatches fired 2026-05-24 between 14:00:38 and 14:12:36 (local time, BRT-3). Total spend: $6.22 (your dispatch costs would be similar). Receipts in ~/.ato/local.db on the author's machine. The scaled-eval bash script + python analysis are at /tmp/grounded-mode-receipts/scaled-eval.sh and /tmp/grounded-mode-receipts/analyze.py — will be cleaned up into scripts/eval/ in the OSS repo when the methodology runner ships in v2.10 PR-1. Falsified claim being retracted: Parts 2–4’s research-finding callout that “gemini hallucinated +109% under soft hints.” The +109% was the single auth.ts cell; the n=10 cross-prompt average ranges from −55% to +22%.