Blog
Notes from the Benchmark
Model audits, benchmark design, and why we measure what we measure.
AuditJune 9, 2026·12 min read
Opus 4.8 vs. GPT-5.5 vs. Composer 2.5: A Behavioral Audit Beyond the Leaderboard
A head-to-head behavioral audit of four agentic models on the same ~150 tasks: how honestly each reports failure, how each spends its compute, and why the leaderboard's near-tie hides real trade-offs.
Read more →VisionMarch 4, 2026·3 min read
The Benchmarks Keep Falling. The Economy Hasn’t Noticed.
AI progress is shaped by what we choose to measure. We are building the instrument that measures what matters — real economic value, not proxy metrics.
Read more →