// compare any AI · keep the receipts ↓
Run any AI on your actual task — see which one solved it cheapest and best, with receipts. One command across Claude, Codex, Gemini, Grok, MiniMax, or any of 20+ supported runtimes. They run the same prompt in one shared session, call real tools (read_file, grep, git_log) to verify claims in your repo, and produce a signed audit trail with cost, tokens, and tool-call receipts. Drive it from a GUI, a CLI, or your coding agent over MCP — same data, same audit trail. Local-first. MIT. Bring your own keys.
You paste the same question into Claude, GPT, Gemini one tab at a time. Each starts from zero. None of them see what the others said. The disagreement that should be the signal is buried in your clipboard history.
Most multi-LLM debate tools can’t read your repo, can’t grep, can’t verify a single claim before stitching the answers together. They’re vibes-as-a-service — clever, but unverifiable.
You get an answer, you read it, you move on. No record of which LLM made which claim, no way to cite “confirmed by GPT, disputed by Claude,” no markdown you can paste into a PR. The receipt is the artifact, and it’s missing.
Every multi-LLM dispatch lands in your local SQLite as a session you can scroll through later. Each row carries an auto-generated summary, the runtimes that spoke, the personas (when you used --agent), tags, and a session id you can pass to ato sessions get from your terminal. No accounts, no cloud round-trip — all on the developer’s machine.
ato review --reviewer @security-specialist --reviewer @perf-reviewer --reviewer claude --reviewer minimax • Function-calling tools (read_file, grep, git_log) • Persistent specialist agents with system prompts • Per-turn audit trail in the GUI — verified-via-N-tool-calls vs prompt-only badges • Lean mode forces the LLMs to walk the live repo
run_agent.@reviewer from Sonnet 4.6 to Opus 4.7 and the dashboard flags “success rate dropped 17pp across 412 conversations.” Joins the configuration-change ledger with trace windows automatically. Severity-tagged: regressions first, improvements second, neutral hidden by default.{user_name}, {project_root}, {recent_orders} in your system prompt. Resolvers: static, env, project path, file, database query, MCP call, computed JS.Pick any past trace. Click Replay. Re-run the original prompt against a different runtime. See source vs replay side-by-side with duration + estimated cost delta. Would Codex have answered correctly on those failing prompts? Now you can find out.
prompt_agent_inner so the replay is itself killable + appears in Live runs. Status pill ticks pending → running → done; result panel renders both responses + duration delta. Source prompts come from your local execution log — ATO never sends prompt content to a server you don’t already use.@code-writer · claude → codex · −59% per call · projected $1.01/mo at this volume. Surfaces concrete swaps when you have multi-runtime history on the same agent and the alternative is meaningfully cheaper at preserved quality. Quality guards: ≥30% cheaper, ok-rate within 10pp, eval-score within 5pp. Renders nothing if no rec qualifies — better than fake confidence.parent_run_id. One row per pipeline; click into the per-stage flow with handoff arrows + per-stage timing + files touched per stage.Per-runtime context breakdown. Switch between Claude, Codex, OpenClaw, and Hermes to see what each agent has loaded. Skills shown as on-demand — not counted in the total.
Manage skills across all runtimes with per-runtime tabs. Browse the marketplace, install community skills, or ask AI to create one for you.
Visual workflow editor that auto-detects flows from your installed skills. Any skill with Step or Phase headers becomes a visual automation.
Pick an agent (or a routed/sequential group) and a schedule. The agent’s system prompt, variables, hooks, memory, and skills all fire on every run — not just a raw prompt.
systemd --user timers on Linux, Task Scheduler on Windows. Jobs fire even when ATO is closed.Centralized dashboard to store, rotate, and scope API keys for every major LLM provider. Keys are encrypted locally — never sent to any server.
Live dashboard showing active agent sessions, token consumption rates, runtime health, and smart alerts — across all your AI coding tools at once.
Complete audit trail of every action across your agentic systems. Filter by action type, resource, and time range. Export to JSON for compliance.
Connect your company's identity provider. Google Workspace, Okta, Microsoft Entra, or any OIDC provider — with domain restriction and auto-provisioning.
Every ATO agent is exposed as an MCP tool. Any MCP-aware runtime — Claude Code, Codex, Cursor, others — can dispatch to any ATO agent regardless of which runtime owns it.
Free, open source, and ready for your platform.
> Early access: every feature free with a cloud sign-up — replay, compare, regression detection, cost recommendations, cloud sync, trace retention, evaluators. No payment, no credit card — just an email.
Complementary, not competing. ATO is your local war room for humans and LLMs — the developer side of multi-runtime AI work. For SDK-based production observability across your deployed app stack, use Langfuse, Helicone, or LangSmith. Most production teams run one from each camp — they cover different sides of the same agent. More on how they fit together →