← Blog
AnalysisJune 9, 2026·10 min read

Codex vs. Claude Code vs. Cursor: An Interactive Comparison

Pick the agents, read the data — performance, cost, domains, and failure modes.

Leaderboards collapse an agent into a single number. But the same benchmark holds far more: how much each agent costs, how long it takes, which domains it is strong in, and how it fails when it fails. This post is interactive: everything below is loaded live from our results database, and you choose which agents to compare.

We start with three mainstream pairings, Codex (GPT-5.5), Claude Code (Fable 5), and Cursor (Composer 2.5), and you can add any of the other agents we ran with the + Add model button.

One metric note that matters: every average below uses, as its denominator, the number of tasks that agent actually has a valid record for, not a fixed 150. Runs that crashed in the environment, or that the grader could not score, are dropped before averaging. So an agent that attempted fewer tasks is not penalized for coverage here; we are comparing quality on what each one ran.

Comparing

1. Performance vs. cost and time

The two panels below plot mean score against total API cost (left) and total agent compute time (right), summed across every valid run. Compute time is agent-only (excluding queue, provisioning, and evaluation), not raw wall-clock, which on this benchmark is dominated by 5-hour timeout walls. Bubble size is total tokens consumed. Up and to the left is better: high score, low cost/time.

2. Where each agent is strong

The benchmark spans 14 top-level domains. Aggregate scores hide large per-domain swings: an agent can lead overall while trailing badly in a specific field. Hover any bar for the task count behind it.

3. How they fail

A score gap tells you that an agent failed, not why. We read the run trajectory of every non-passing run for the three featured agents and tagged each against a unified failure taxonomy of four top-level groups (Understanding, Approach, Execution, Infrastructure) split into eight classes, each with a fine-grained reason. A run can carry several tags. Click a class to see the fine-grained reasons and concrete examples.

The headline: every agent over-claims, but to different degrees. Codex (GPT-5.5) leans hardest on it, with about 73% of its failing runs are tagged hallucination / fabrication, almost always “assumed-unverified-success”: a confident “done, all checks pass” over a result the grader scores 0. Composer 2.5 (~57%) and Fable 5 (~55%) follow. Where Fable differs is in failing more structurally instead: it carries the highest share of output-format errors and timeouts. Same benchmark, different failure signatures.

Each line is one agent; each point is the share of that agent's non-passing runs tagged with a failure class.

A run can be tagged with several classes, so an agent's points do not add up to 100%. Click any point for the fine-grained reasons and examples.

Codex (GPT-5.5)Claude Code (Fable 5)Cursor (Composer 2.5)

4. The tasks where they disagree

Aggregates aside, the sharpest signal is a single task where strong agents split. Below are every task on which the three featured agents disagree (at least one passed and at least one failed), grouped by domain. Tasks where all three pass or all three fail are left out. Each card shows the per-agent pass/fail lights; open it for a detailed, trajectory-grounded account of exactly where each failing agent went wrong.

28 tasks where the three featured agents disagree (at least one pass and one fail). Tasks all three pass or all three fail are omitted. Click any card for the per-agent failure analysis.

🏥Health & Medicine· 6

🧬Life Sciences· 5

💻Computing & Mathematics· 5

🔬Physical Sciences· 4

💼Business & Finance· 4

🎨Visual & Media Arts· 2

📚Education & Information· 1

🌍Social Sciences· 1

Methodology

  • Data is aggregated live from the experiment_logs table. A valid record excludes environment/harness crashes and runs the grader could not score.
  • Averages are macro (per-task equal weight); a task with multiple runs is collapsed to its mean before averaging, and the denominator is the agent's own valid-task count.
  • Cost comes from logged per-run cost where available; for harnesses that do not log cost (e.g. Codex), we fall back to an authoritative token-pricing snapshot, flagged in the chart.
  • Failure tags come from an LLM reading each run's trajectory summary (not the full transcript) and classifying it against the shared taxonomy.