Blog

Notes from the Benchmark

Model audits, benchmark design, and why we measure what we measure.

AnalysisJune 11, 2026·11 min read

Does the Harness Matter? Lessons from ALE-Claw on Agents’ Last Exam

Swapping the model under a fixed harness moves ALE pass rate by 18 points; swapping the harness moves it by 5–6. ALE-Claw, a deliberately minimal harness, matches heavier ones at a fraction of the cost. Interactive: pick any agents to compare cost, time, and tool-call mix.

AnalysisJune 9, 2026·10 min read

Codex (GPT-5.5) vs. Claude Code (Fable 5) vs. Cursor (Composer 2.5): An Interactive Comparison

A live, interactive comparison of agentic coding harnesses on the agenthle benchmark. Toggle any model to compare performance, cost and time, per-domain strengths, and a unified failure-mode taxonomy built by analyzing run trajectories.

AnnouncementJune 14, 2026·5–7 min read

Agents’ Last Exam

We evaluated Fable 5, GPT-5.5, Composer 2.5, and other frontier systems on 1,500+ expert-sourced tasks spanning 55 occupations. Today’s agents solve a meaningful fraction of professional work, but on ALE’s hardest tier every frontier agent we tested, including Fable 5, scored 0%.

VisionMarch 4, 2026·3 min read

The Benchmarks Keep Falling. The Economy Hasn’t Noticed.

AI progress is shaped by what we choose to measure. We are building the instrument that measures what matters — real economic value, not proxy metrics.