AnalysisPublished June 9, 2026·Last updated June 11, 2026·10 min read

Codex (GPT-5.5) vs. Claude Code (Fable 5) vs. Cursor (Composer 2.5): An Interactive Comparison

Pick the agents, read the data: performance, cost, domains, and failure modes.

Leaderboards collapse an agent into a single number. But the same benchmark holds far more: how much each agent costs, how long it takes, which domains it is strong in, and how it fails when it fails. This post is interactive: everything below is loaded live from our results database, and you choose which agents to compare.

We start with three mainstream pairings, Codex (GPT-5.5), Claude Code (Fable 5), and Cursor (Composer 2.5), and you can add any of the other agents we ran with the + Add model button.

Comparing

1. Performance, cost, and speed

The two panels below plot pass rate (the share of tasks fully solved, i.e. runs scoring 100%) against total API cost (left) and total agent runtime (right). Both resource metrics sum each task's average valid-run value across the benchmark. Runtime is measured from agent start to agent finish, matching the public leaderboard's Total Runtime; queueing, environment setup, evaluation, and output sync are excluded. Use the toggle to switch the y-axis to mean score, or show both at once: solid bubbles are pass rate, striped ones are mean score. Bubble size is total tokens consumed; hover a bubble to see every metric. Up and to the left is better: high pass rate / score, low cost/time.

Every run is also subject to a hard 5-hour wall-clock cap: an agent that is still working after 5 hours is terminated and graded on whatever it has produced by that point. The timeout rate below the charts is the share of an agent's runs that hit this cap. It is a complementary signal to raw speed, since an agent can have a low average time yet still stall out on hard tasks.

2. Strengths by domain

The benchmark spans 13 top-level domains. Aggregate scores hide large per-domain swings: an agent can lead overall while trailing badly in a specific field.

3. Sample tasks where agents perform differently

The most revealing tasks are the ones where strong agents split: at least one passes and at least one fails. Below, grouped by domain, is every such task for the three featured agents. Each card shows who passed and who failed; click it to read, based on the actual run trajectories, exactly where each failing agent went wrong.

🏥Health & Medicine· 5

🧬Life Sciences· 4

💻Computing & Mathematics· 5

🔬Physical Sciences· 4

💼Business & Finance· 4

🎨Visual & Media Arts· 2

📚Education & Information· 1

🌍Social Sciences· 1

4. Fable 5's safety fallback

Updated Jun 11

TL;DR

On ~35% of tasks, Fable 5's request was refused upstream and Claude Code silently switched the run to Opus 4.8 mid-task — almost entirely benign life-sciences, health, and physical-science work flagged as "cybersecurity or biology." The scores below therefore aren't pure Fable 5: on the untouched tasks Fable 5 matches Codex (GPT-5.5) and beats Opus 4.8, but on the flagged tasks the forced switch drags it down to Opus-4.8 level — a ~6-point pass-rate haircut traceable to the safety fallback, not the model.

Reading the raw Claude Code transcripts surfaced something the leaderboard number hides. On a subset of runs, Fable 5's request was refused upstream and the harness silently retried the same turn on Opus 4.8 mid-task. Claude Code records it as a structured system event, not as model text:

{"type":"system","subtype":"model_refusal_fallback","trigger":"refusal",
 "direction":"retry","original_model":"anthropic/claude-fable-5",
 "fallback_model":"claude-opus-4-8",
 "content":"Fable 5's safety measures flagged this message for cybersecurity
  or biology topics ... Switched to Opus 4.8."}

Grepping that signature — model_refusal_fallback — across the Fable 5 transcripts, the fallback fired on ~35% of the benchmark tasks. Each carried the same "cybersecurity or biology topics" message and resolved Fable 5 → Opus 4.8. The flagged set skews to genomics, clinical, molecular, and reverse-engineering work — many apparently benign scientific tasks the filter caught alongside anything genuinely sensitive. Anthropic documents this behavior in the Fable 5 model card (refusals, fallback & billing).

Where the fallback fired — the 7 domains with at least one flagged task

Fell back to Opus 4.8Ran on Fable 5

The fallback is overwhelmingly a life-sciences / health / physical-sciences phenomenon — every life-sciences task (19/19) was flagged — while finance, visual media, transport, legal, and agriculture were never touched. This effectively splits the benchmark in two. On the affected tasks, Fable 5 is really a Fable 5 → Opus hybrid, so it tracks Opus 4.8 almost exactly. On the unaffected tasks — pure Fable 5, run end to end — it pulls ahead of Opus and lands much closer to Codex (GPT-5.5):

Split	Tasks	Fable 5	Opus 4.8	GPT-5.5
Unaffected (pure Fable 5)	101	22.8%	15.8%	23.8%
Affected (Fable 5 → Opus hybrid)	51	17.6%	15.7%	17.6%

On affected tasks, the Fable 5 column is not pure Fable 5; it is the post-switch Fable 5 → Opus 4.8 hybrid, which tracks Opus closely. On unaffected tasks, where Fable 5 runs end to end, it looks much closer to GPT-5.5. The leaderboard score should therefore be read as a mixed-system result, not a clean estimate of standalone Fable 5 capability.

Codex (GPT-5.5) vs. Claude Code (Fable 5) vs. Cursor (Composer 2.5): An Interactive Comparison

1. Performance, cost, and speed

2. Strengths by domain

3. Sample tasks where agents perform differently

🏥Health & Medicine· 5

Build A Pinned Clinical Variant Annotation Table

Clinical Variant Annotation

MicroDicom NIH CXR Reader Adjudication

Prostate IMRT matRad Reproduction

Scene3 Skullstrip QC

🧬Life Sciences· 4

K562 Genome Browser SVG Export

Marrow Cell Type Annotation

Protein Function Annotation With InterProScan

TCGA BRCA Differential Expression Analysis

💻Computing & Mathematics· 5

Branch-and-Bound ATSP Solver

Checkpoint Consolidation V2

Finite Abelian Extension Classification

Ranking Node Feature Parity Recovery

Synthetic Causal Structure Inference

🔬Physical Sciences· 4

Egt710 Table1 SMILES Extraction

Exact Diagonalization of the J1-J2 Heisenberg Model

GLM Lake Calibration

Lenacapavir SAR Table2 Extraction

💼Business & Finance· 4

Equity Research Summary

Metabase BI Dashboard

SaaS One-Pager Brand Refresh

Taxform 4 1

🎨Visual & Media Arts· 2

Inkscape Cultural Poster Design

butterfly_flap_animation

📚Education & Information· 1

MARC Remediation FOLIO Overlay

🌍Social Sciences· 1

Atwood 2022 Measles Vaccine Coefficient Reproduction

4. Fable 5's safety fallback