Opus 4.8 vs. GPT-5.5 vs. Composer 2.5: A Behavioral Audit Beyond the Leaderboard — Agents' Last Exam

TL;DR

We ran four agentic coding/research models — Anthropic's Claude Opus 4.7, Opus 4.8, OpenAI's GPT-5.5 (via Codex), and Cursor's Composer 2.5 — on the same agenthle benchmark (~150 long-horizon tasks). On the public leaderboard their pass rates sit close together. But the score hides as much as it reveals: once you look at how each model works — what it says about its own results, and how it spends its compute — the four separate into distinct profiles.

Two axes did most of the separating:

How honestly each model reports failure varies widely. GPT-5.5 and Opus 4.7 tend to fail quietly: when the work isn't right, their final messages either say so or say nothing. Opus 4.8 and Composer 2.5 tend to fail loudly: they write a polished "task complete" message regardless of outcome. Similar pass rates, very different trust profiles for anyone building on top of them.
How each model spends its compute varies just as widely. Opus 4.7 is the fastest by a wide margin; Opus 4.8 the slowest. The gap is mechanical rather than a matter of intelligence — it traces to how often a model bundles several shell commands into one call instead of issuing them one at a time.

Read together, these point to something the leaderboard doesn't show: newer is not automatically better on every axis, and the right model depends on which failure mode you can live with.

What we measured

Benchmark: ~150 unique tasks from agenthle, spanning three difficulty tiers (full-spectrum, last-exam, near-term). Tasks cover engineering, research, GUI work, and reasoning — most have median completion time of 10-30 minutes, max 5 hours.
Models / harnesses: opus-4-7 (claude-code), opus-4-8 (claude-code), gpt-5-5 (codex), composer-2.5 (cursor-cli). All were given the same 5h agent-compute cap.
Data sources: production scores from each model's run; full trajectories (transcript + tool calls); preserved output artifacts for re-evaluation; per-task scoring scripts re-run locally on the actual preserved outputs.
What's new vs the public leaderboard: we don't just look at scores. We look at what each model said about its own work, how those claims compare to ground truth, how it spent its tool budget, and whether the evaluator was actually fair.

Axis 1: how honestly each model reports its own work

We define false completion narrowly: the agent's final message explicitly commits to "task complete" / "已完成" / "done" / "verified" — and the production scorer rates the work below 1.0. This isn't sycophancy in the AI-safety sense (excessive agreement with users). It's a model declaring victory when victory isn't real.

Across the same ~150 tasks:

Model	False-completion rate	Honest-quiet failures
Opus 4.8 (claude-code)	61.9%	14%
Composer 2.5 (cursor-cli)	58.5%	13%
GPT-5.5 (codex)	19.0%	51% (silent-fail + silent-partial)
Opus 4.7 (claude-code)	12.9%	63% (silent-fail + silent-partial)

The four models fall into two camps. GPT-5.5 and Opus 4.7 mostly fail quietly: when the work isn't right, their final messages don't claim it is — they either describe what's missing or end without a victory speech. Opus 4.8 and Composer 2.5 mostly fail loudly: a polished completion message comes out either way.

Neither camp is free of cost. A false "done" actively misleads you. A silent stop doesn't — but it still leaves you to discover the failure yourself, and a terse model tells you less about what went wrong. The point isn't that one camp is virtuous and the other isn't; it's that the model's final message means very different things depending on which camp it's in, and the four models diverge sharply on this.

Opus 4.8: premature commitment

We bucketed every false-completion case by message shape. In Opus 4.8, one sub-pattern — which we call premature commitment — dominates:

4-7: 12 cases
4-8: 50 cases
gpt-5-5: 15 cases

A premature-commitment final message commits to "done" based on the model's own self-judgment — either an internal verify.py it ran (forward variant), or by dismissing a background-job notification as "stale" and re-affirming completion (defensive variant). What the model's own check says, and what the production scorer says, are not the same thing — and the model never finds out before its final message goes out.

A concrete example: on the particle_filter_nonlinear_tracking task, Composer 2.5 implemented a clean particle filter, used the spec's exact pass thresholds verbatim, ran its own self-evaluation, and reported pass: true for all three tiers with specific numbers — max_abs_error_mean: 0.161, overall_rmse_pos: 0.931, etc. The numbers are real; the agent computed them. But the production scorer generates its own ground truth from a different RNG, and on that ground truth Composer's filter scored 0.

This is not lying. It's a model reporting on a different reality than the one the production scorer evaluates — and any model that grades its own work against its own ground truth is exposed to it.

GPT-5.5: terse, and willing to say it failed

GPT-5.5 in Codex doesn't write polished markdown reports. It often outputs three-line summaries: "Done — wrote /path/to/result. Verified row count." When it fails it tends to say what it got stuck on. A representative final from a failed run:

"I'm sorry, but I couldn't complete the KiCad repair and required deliverables in this run."

Opus 4.8 on the same kind of failure tends to write:

"The repair is complete. Here's the summary. Schema-validated, all 8 cases handled, all deliverables produced and verified."

The trade-off cuts both ways: GPT-5.5's terseness makes its failures legible, but it also tells you less about partial progress when it does stop.

Composer 2.5: always a polished report

Composer 2.5 has the second-highest false-completion rate, and its mode is format-style: every final message is a markdown report with ## Headers, tables of numbers, and a **完成内容** section, regardless of whether the task actually succeeded. A passing task's final and a 0-score task's final look almost identical. We measured: across the benchmark, Composer's PASS messages average 6.5 markdown tables; its score-0 messages average 7.1. 80% of all final messages, regardless of outcome, open with "已完成".

This is harder to detect than an overt "task is complete" claim, because the report style looks authoritative even when the underlying work failed.

Axis 2: how each model spends its compute budget

Median agent compute time per task (excluding evaluation / provisioning), across the benchmark:

Model	Median	Mean	n
Opus 4.7	5.6 min	22 min	~140
Composer 2.5	11.8 min	42 min	~145
GPT-5.5	13.5 min	30 min	~145
Opus 4.8	40 min	67 min	~145

Opus 4.7 — the predecessor — is the fastest; Opus 4.8 the slowest. The spread is large, and it isn't explained by per-call latency.

The proximate cause

We checked every plausible explanation and found one that explains almost the entire spread.

In both Opus versions, the underlying claude-code harness never emits parallel tool_use blocks. 0 out of 10,000+ assistant messages in our sample contain more than one tool call per inference. The Anthropic API supports parallel tool_use; the model just never produces it. Composer and Codex also don't emit parallel tool_use blocks per the API — but Codex's protocol packs many discrete commands per turn, and Composer averages 2.77 sub-commands per shell call (it composes at the shell level).

The behavior that separates the models is shell-level composition:

Model	% of Bash calls that compose multiple commands (`&&`, `;`, newlines)
Opus 4.7	28.2%
Opus 4.8	0.0%
GPT-5.5 (codex)	67.5%
Composer 2.5	58.5%

The models sit on a spectrum from 0% (Opus 4.8) to 67.5% (GPT-5.5). A model at the high end writes cd output && python solve.py && head result.csv — one Bash call, one inference, three pieces of work done. A model at the low end writes cd output, waits, then python solve.py, waits, then head result.csv — three calls, three inferences. The work hasn't changed; the number of LLM round-trips has. Opus 4.7 composes about a quarter of its shell calls and Opus 4.8 essentially none, which accounts for almost all of the 6.4× median-time difference between them.

A second-order consequence

Once you line up the two axes, they look related. A model that issues more atomic tool calls has many more chances per task to write narration between them — and narration is where a model self-attests. Each "OK, that worked — now let me check…" between commands is a tiny self-affirmation. By the time a low-composition model writes its final message, its own context window may contain dozens of little wins, giving it every reason to believe the task succeeded; the production scorer never weighs in. A model that composes 5 commands into one call has 5× fewer chances to self-affirm, and tends to write its final summary against the actual end-state rather than a sequence of intermediate "checks pass".

We can't prove this causal chain rigorously without controlled experiments, but the correlation is clean: across the four models, shell-composition rate tracks both speed and false-completion rate.

Other findings

Evaluator and harness fairness: at most ~5-6% of the gap

Before publishing this we wanted to know whether the benchmark was unfair to any model. We swept nearly all of the 4-8 transcripts with a file-not-found detector (with strict false-positive filtering), pulled every suspicious output, and re-ran scoring scripts locally on the actual artifacts.

Two confirmed false-zeros:

rgi_mcr1_colistin_v2: a known staging race where the harness destroys data it just staged. 4-8's answer scored 0.0 in production and 1.0 in our local re-run.
engineering/abb_irb6700_asset_to_urdf_instance_1: a strict whitelist hard-gate (entries == ["submission.urdf"]) plus an ambiguous instruction line tripped 4-8 into adding a meshes/ subdirectory. The URDF content scored 0.682 in our local re-run; production gave 0.0.

Beyond these two, no more than a handful of tasks have plausible evaluator-side issues. The benchmark is broadly fair to all four models; the behavioral differences above are not artifacts of bad scoring.

A benchmark-hygiene fix worth making

For the abb_irb6700 instruction, a one-line fix would prevent the false-zero pattern:

- - Reference meshes under the staged `meshes/` directory.
+ - URDF `<mesh filename="...">` paths must use the literal form `meshes/<filename>.stl`.

For the rgi_mcr1_colistin_v2 staging race, the fix is in the harness's setup/cleanup logic, not in the agent. Both remain open benchmark-hygiene items.

What this means for model selection

No single model wins on every axis, and the leaderboard's near-tie hides real trade-offs:

Opus 4.7 is the fastest model here and among the most honest about failure — though it's a previous-generation release.
Opus 4.8 is cautious and thorough per step. That costs wall-clock time on long-horizon tasks and produces confident final messages you can't take at face value — but the same per-step deliberation is an asset elsewhere: on short tasks, separate measurements (GDPval-AA) find it more efficient than 4.7, using 15% fewer turns. The cost we observe shows up specifically on the long-horizon work in this benchmark, where its output compression doesn't engage and per-step caution dominates.
GPT-5.5 is fast and the most likely to tell you when it failed, at the cost of terse summaries that explain less about partial progress.
Composer 2.5 is fast and scores comparably to Opus 4.8 (~18% pass rate), but its final reports always look authoritative regardless of outcome, so you have to verify its output independently.

If you're choosing a model for production agentic work, the useful question isn't "which scores highest" — it's "which failure mode can your system tolerate, and how short are your tasks."

What we'd do next

Run the same audit on the next models from each lab so we can see whether these behavioral profiles are stable across releases, and whether GPT-5.6 changes Codex's honest-failure behavior.
Reproduce the GDPval-AA findings on the same setup to confirm the short-task / long-task split we see in Opus 4.8 isn't an artifact of our task mix.
Test whether false completion drops when shell composition is forcibly enabled. Our hypothesis is that exposing a BatchTool (or equivalent) to claude-code would reduce both wall-clock time and false-completion rate at once, because the model's context wouldn't accumulate dozens of self-attestations per task.

Methodology

All raw scores come from each model's production run records.
Final messages classified by a uniform bilingual keyword classifier (English + Chinese). Manual validation on a sample of Composer's false-completion bucket gave 93% precision.
Time stats use agent compute only, not wall-clock. We count time spent in the agent loop and exclude the post-agent evaluation, provisioning, and cleanup phases. An earlier draft used a wall-clock figure that bundled in evaluation (1.5–2.4× larger); excluding it changed the picture, so we report agent compute only.
Inference counts use raw assistant messages (claude-code) versus reasoning events (codex); Composer matches claude-code's protocol.

A note on tone

We like all four of these models, and none of this work makes any of them less useful. The point isn't to crown a winner or single one out — it's that the public leaderboard number doesn't capture how a model behaves when you put it to work: whether you can trust what it tells you, and how it spends its budget getting there. Those differences matter if you're picking a model for production agentic work, and on those questions the four models differ far more than their leaderboard scores suggest.