2026-05-25 · Build log · Part 7 of a v2.9/v2.10/v2.11 series · 8 min read

We used ATO to test ATO — Part 7. Gemini-flash beat sonnet on the same 5 prompts at 7-13× lower cost.

v2.9/v2.10/v2.11 build log — series.

Part 1 — v2.9 PR-1: grounded-mode foundation.
Parts 2–4 — v2.9 PR-2/3/4.
Part 5 — n=150 scaled re-run; one Part 4 claim falsified.
Part 6 — v2.10 methodology runner ships.
Part 7 (you are here) — cross-model n=30 on the same 5 prompts. What it says about which model to pick.

Two months of receipts have been telling us the same thing: the model that wins on benchmarks isn’t always the model that wins on YOUR work. Part 7 is what happens when you take that seriously, fire 150 dispatches across two models at industry-baseline n=30, and read the receipts. The headline finding is uncomfortable and worth fifteen minutes of your time if you spend more than $50/mo on LLM calls.

The setup — what we actually ran

Five prompts that look like real engineering work: security audits on hypothetical source files, two non-security questions in the same shape. We’ve been running these for the last few weeks under the v2.9 grounded-mode build log; they survived the n=150 statistical rinse Part 5 documented.

P1: "Review src/auth.ts in this repository for SQL injection vulnerabilities."
P2: "Find any race conditions in src/session.ts. Walk the code."
P3: "Check whether src/billing.ts has test coverage. Look at the tests folder."
P4: "Audit src/config.ts for environment variable leaks or hardcoded secrets."
P5: "List the public HTTP endpoints exposed by src/server.ts."

Two models, same prompts: claude-sonnet-4-6 (Anthropic) and gemini-2.5-flash (Google). One rubric: "did the response mention at least one security-finding keyword (vulnerability / injection / race / leak / secret / exposure / hardcoded / CVE)?" Score: 1.0 if yes, 0.0 if no. Industry-baseline n=30 dispatches per prompt-model cell. Total: 150 fresh gemini dispatches (the claude side was already in our DB from the v2.9 work).

Total spend to produce this entire blog post’s data: $0.56 on the gemini side via the customer’s own API key. The claude side cost ~$6.20 when we ran it originally. Receipts are real and reproducible — the exact bash to replay this on your machine is at the bottom of the post.

The results

Prompt	claude n=30 score	gemini n=27-30 score	claude $/call	gemini $/call	$ ratio
P1 src/auth.ts SQL injection	0.533	1.000	$0.0311	$0.0042	7.4×
P2 src/session.ts race conditions	0.467	1.000	$0.0539	$0.0072	7.5×
P3 src/billing.ts test coverage	0.000	0.000	$0.0382	$0.0016	24×
P4 src/config.ts env leaks	0.900	1.000	$0.0632	$0.0047	13×
P5 src/server.ts HTTP endpoints	0.000	0.000	$0.0090	$0.0018	5.0×

Three things to read out of this. Worth sitting with each.

Finding 1: gemini-flash dominated on the security prompts at a fraction of the cost

On the three prompts that actually are security questions (P1, P2, P4), gemini-flash scored a perfect 1.000 across n=27-30 dispatches with zero variance. Every single dispatch mentioned at least one security keyword. Claude scored 0.467 to 0.900 on the same prompts — a meaningful chunk of dispatches drifted off-topic, refused, or gave generic answers that didn’t name the failure mode.

And it did this at 7-13× lower per-call cost. On the longer security audits (P2 race conditions, P4 env leaks), the cost gap stretches to 7-13×. For a real customer running 100 security reviews a day, the difference between $5 and $0.40 per day adds up to $1,700/year for the same or better quality.

This is the cost-quality Pareto frontier the methodology runner exists to surface. The conventional “claude is the higher-quality model” intuition isn’t wrong in general — it’s built on broader benchmarks. But your workload isn’t the benchmark. On THIS rubric, on THESE prompts, gemini-flash is the Pareto-better choice and the data is statistically locked.

Finding 2: when the rubric is wrong, the agent isn’t the problem

Prompts P3 (“does src/billing.ts have test coverage?”) and P5 (“list HTTP endpoints in src/server.ts”) scored a perfect 0.000 on BOTH models. Not because both models are bad at these questions — they’re competently answering whether tests exist and which endpoints are exposed. They just aren’t volunteering the security-keyword vocabulary the rubric expects, because the question isn’t a security question.

This is Goodhart’s law in evals: when your measure stops capturing what you care about, optimizing for the measure makes things worse. Two practical implications:

The cross-model evidence is the proof. If only claude had scored 0.000 on these prompts, we’d be tempted to ship a system-prompt patch that forced security keywords into every response. With gemini ALSO scoring 0.000, we have independent confirmation the failure mode lives in the rubric, not the agent. Cross-model evaluation is the strongest anti-Goodhart defense you can build, and it’s the second thing ATO makes one-line cheap.
Splitting your rubric by prompt class beats one global rubric. The security regex should apply to P1/P2/P4 only. P3 wants a coverage-percent regex. P5 wants an endpoint-shape regex. The methodology runner has rubric-per-prompt support; we just hadn’t configured it for this study.

Finding 3: the cost gap widens with response complexity

P5 (simple endpoint listing — short responses) shows the smallest cost ratio: 5×. P3 (test coverage check — medium responses): 24×. The security audits (long, detailed responses): 7-13×. The pattern: gemini-flash’s per-token output rate is similar to sonnet’s, but its per-input-token cost is much lower — so prompts with rich context (codebase walkthroughs, long file audits) widen the gap.

This generalizes: if your workload uses long prompts with detailed context, you’ll see more of gemini’s cost advantage. If your workload is short Q&A, the gap narrows. Worth quantifying on your own data before you pick.

What this means for your stack

Three concrete moves:

Don’t default to the model with the highest published benchmark score. Default to the one that wins on YOUR five most-used prompts at industry-baseline n=30. That data takes ~$1 and ~90 minutes to produce per model with ATO.
Run your evals cross-model. If a finding only shows up on one model, you can’t separate “the agent is wrong” from “the rubric is wrong.” If both models show the same failure, the rubric is wrong.
Re-run the test when models update. The findings above are dated 2026-05-25. Frontier models ship monthly; Pareto frontiers shift. Download ATO + schedule the same methodology to re-run weekly; alert on score drift. The methodology runner’s regression-watch archetype exists for exactly this.

How to reproduce this on your data

Full disclosure: the 150 receipts behind this post were fired through our Pro orchestrator (ato evaluations methodology run) because that’s what we use ourselves. You don’t need it. The capability is free — customer’s API keys pay, customer’s machine runs the dispatches, no cloud roundtrip. The Pro tier just collapses the by-hand recipe below into one command + ledger + schedule. Here’s both paths, side by side, so you can choose.

The free way: by hand with `ato dispatch`

The orchestration in this post (the fan-out, the per-cell stats, the cost ledger) is the Pro tier. The capability is free: ato dispatch ships in the OSS binary and gives you everything you need to reproduce the study. A few more lines of bash; same numbers at the end. Here’s the recipe.

# Install the free OSS binary
brew install willnigri/ato/ato

# Pick your prompts (the same five-prompt security-review set we used)
PROMPTS=(
  "Review src/auth.ts for SQL injection vulnerabilities."
  "Audit the password reset flow in src/auth.ts for timing attacks."
  "Find authorization bypasses in src/auth.ts."
  "Identify hard-coded secrets risk in src/auth.ts."
  "Review src/auth.ts session-token handling for CSRF risk."
)

# Pick your models
MODELS=("claude-sonnet-4-6" "gemini-2.5-flash")

# Pick your N (industry baseline is n=30 per cell)
N=30

# Fire the matrix yourself. Each call is one cell observation; ato
# writes a receipt to ~/.ato/local.db that you can query later.
mkdir -p ./receipts
for model in "${MODELS[@]}"; do
  for prompt in "${PROMPTS[@]}"; do
    for i in $(seq 1 $N); do
      ato dispatch claude "$prompt" \
        --model "$model" \
        --quiet \
        --output-format json \
        > "./receipts/$model-$i-$(date +%s%N).json"
    done
  done
done

# Aggregate per-cell. Each receipt has cost_usd_estimated, tokens_out,
# response. Score the response with your own grep / jq / regex.
for cell in ./receipts/*.json; do
  model=$(jq -r '.model' "$cell")
  cost=$(jq -r '.cost_usd_estimated' "$cell")
  matched=$(jq -r '.response' "$cell" | grep -ciE "(sql injection|vulnerability|sanitize)")
  echo "$model,$cost,$([ "$matched" -gt 0 ] && echo 1 || echo 0)"
done | awk -F',' '
  { sum[$1]+=$3; cost[$1]+=$2; n[$1]++ }
  END { for (m in n) printf "%-25s mean=%.3f  total_cost=$%.4f  n=%d\n",
                   m, sum[m]/n[m], cost[m], n[m] }'

That’s ~30 lines of bash and a known shell utility you already have. Run it overnight. You’ll get the same cost figures we got, scored against your own regex, with the same n=30 per cell. No subscription, no cloud, no Pro tier. The free building blocks are intentionally enough to do this.

What you don’t get for free: resumable state when 7 of your 30 dispatches fail mid-loop, the structured methodology config so you can re-run the same study next month, the per-cell Welch t with 95% CIs computed for you, the cost-ledger view that splits customer-spend from infra cost, the rubric library that pre-packages LLM-as-judge in addition to regex, the schedule-create that re-fires this weekly to catch model drift, the diagnose pipeline that reads your worst cells and proposes an agent change. That’s the line.

The Pro way: one command

Everything in the by-hand recipe above — orchestrated, resumable, scored, ledger’d, schedulable — collapses to four lines.

brew install willnigri/ato/ato
ato pro install                                  # downloads the Pro binary

# 1. Write your methodology config: 5 prompts × 2 models × n=30
cat > my-test.json <<'EOF'
{
  "slug": "cost-quality-test-2026-05",
  "archetype": "model-ladder",
  "variant_matrix": {
    "prompts": [
      "your prompt 1",
      "your prompt 2",
      "your prompt 3",
      "your prompt 4",
      "your prompt 5"
    ],
    "models": ["claude-sonnet-4-6", "gemini-2.5-flash"],
    "conditions": ["default"],
    "reps_per_cell": 30
  },
  "rubric": {
    "kind": "regex",
    "pattern": "(?i)(YOUR_EXPECTED_KEYWORD_PATTERN)",
    "case_insensitive": true
  }
}
EOF

# 2. Preview cost (free — pure math)
ato evaluations methodology create --config my-test.json
ato evaluations methodology cost-estimate cost-quality-test-2026-05 --human

# 3. Fan it out (Pro — the codified orchestrator)
ato evaluations methodology run cost-quality-test-2026-05 --human

# 4. Read the per-cell composition (Pro — automated stats + Welch t)
ato evaluations methodology runs show <run-id> --human

# 5. Schedule it weekly so you catch model drift (Pro — codified cron)
ato evaluations methodology schedule create weekly-drift-watch \
  --methodology cost-quality-test-2026-05 --cron "0 9 * * MON"

The Pareto frontier prints to your terminal. If gemini wins on your data too, you have your answer.

What you get free vs. what ATO Pro adds

The principle: free = the building blocks you can script with; Pro = the codified one-button automations on top of those blocks. If you want to do everything by hand, you can. If you want us to do it for you, that’s where you pay.

Free. ato dispatch, war-rooms, sessions, replay, the SQLite receipt log, the agent file format, the rubric library (regex / structural / LLM-as-judge), the math (Welch t, normal-CDF approximation, per-cell stats), the methodology config format, viewing past runs. Everything you need to reproduce a study like this one by writing the 30-line bash loop above. Your keys, your machine, your data.

ATO Pro ($29/mo). The codified one-button versions of those same workflows: methodology run (orchestrator that fans the variant matrix out for you, resumable, ledger’d); methodology adopt (ingest existing dispatches into a structured run); methodology score (run a rubric against any past dispatch); methodology margin (cost-ledger view splitting your spend from infra); methodology diagnose (an LLM reads your failing cells and proposes a specific change to your agent definition); --apply with lineage tracking, the holdout-set A/B; methodology schedule create for scheduled re-runs that catch model drift before you do; auto-revert when production telemetry says a change regressed. The Pro tier is the harness; the free tier is the parts to build one yourself if you’d rather. Learn about Pro →

Honest disclosure

Sample sizes weren’t identical. Two of the gemini cells came in at n=27-29 instead of n=30 because of API timeouts under sustained load. The errored dispatches are still in our cost ledger; the score-cell-mean is computed over only the successful ones.
The rubric is intentionally simple. A regex match on security keywords isn’t a serious security assessment — it’s a coverage probe. More demanding rubrics (structural assertions about response shape, LLM-as-judge scoring) would split the models differently. The free methodology score command supports all three rubric kinds.
One snapshot. The findings are valid for these specific models on 2026-05-25. Gemini-flash and sonnet have both moved. Re-run before making procurement decisions on numbers older than a month.

Download ATO → Learn about Pro →

Real data behind this post: 150 fresh gemini-2.5-flash dispatches fired 2026-05-25 between 00:25 and 01:49 UTC against the same five Part 5 prompts. Customer-side spend: $0.56. Receipts in ~/.ato/local.db on the author’s machine. The claude-sonnet-4-6 data came from the v2.9 build log (Part 5) which lives in the same database. We ran these with Pro’s methodology run orchestrator because it’s our own work; the same 150 dispatches were reproducible by hand with the free ato dispatch primitive and a short bash loop — that’s the boundary we drew.