← Blog
AnalysisJune 11, 2026·11 min read

Does the Harness Matter? Lessons from ALE-Claw on Agents’ Last Exam

A strong agent needs a harness. The question is how much harness is enough.

By Yixiao Huang and Yiyou Sun

TL;DR

  • On Agents' Last Exam (ALE), model choice moves the score more than harness choice: under a fixed OpenClaw harness, the model sweep spans 18.0 percentage points, while fixed-model harness sweeps span only 5 to 6 points.
  • We built ALE-Claw, a deliberately small computer-use harness derived from OpenClaw. It removes product-assistant machinery while keeping the core agent loop, and reaches the same accuracy band with 44% fewer input tokens, 41% lower cost, and 60% less wall-clock time than OpenClaw.
  • A richer harness is not automatically a better one. Across the five GPT-5.5 harnesses, neither a bigger tool surface nor a heavier product layer came with a higher score.

1. The Debate

Since the launch of ReAct, agent builders have spent the last few years making harnesses more elaborate. The basic loop is still the same: build context, call the model, dispatch a tool, observe the result, compact or prune context, and repeat until the agent submits a final answer.

Around that loop, production systems add a lot: memory, skills, planning tools, sub-agents, and user preferences. Some of those features clearly matter in interactive products. If an assistant is supposed to work with a human over weeks, remember preferences, ask clarifying questions, and recover from ambiguous instructions, a thin benchmark loop is not enough.

The recent Terminal-Bench writeups make this point forcefully. KRAFTON's Terminus-KIRA post argues that a very minimal terminal harness left frontier models with avoidable failure modes. ForgeCode's “Benchmarks Don't Matter — Until They Do” post tells a similar story from the other direction: the same model moved from weak benchmark performance to state of the art after the runtime was made non-interactive, faster, and stricter about tools and planning.

So the natural takeaway is: harnesses matter a lot.

There is also a cautionary version of the same story. A recent DebugML audit found cheating or reward hacking across 28+ submissions and 9 benchmarks, including Terminal-Bench scaffolds that exposed verifier files or injected non-official AGENTS.md answer keys into the agent context. In one audited case, replacing tainted ForgeCode traces with clean-scaffold runs on the same model dropped pass rate from 81.8% to 71.7%. This leads one to ask: did the scaffold genuinely make the agent better, or did it just cheat or overfit to the benchmark?

ALE gives us a chance to test how far that claim travels in a different setting. Terminal-Bench is centered on terminal tasks, where a task- or domain-specific harness can be a large advantage. ALE instead asks agents to handle long-running professional work that can last for hours and span many industries, which pushes the harness toward a more general computer-use interface.

In particular, we are asking:

If the model is strong, how much does the harness still move the score?

2. ALE and the Shared GCUA Harness

Agents' Last Exam evaluates Generalist Computer-Use Agents (GCUAs): agents that can operate across shell, files, GUI applications, and web research rather than only inside one terminal workflow.

ALE is designed to measure sustained performance on long-horizon, economically valuable, real-world work with verifiable outcomes. Developed with industry experts and grounded in the O*NET-SOC taxonomy, the public benchmark spans around 150 tasks across 55 subfields and 13 industry clusters. Many runs can last for hours, so ALE stresses whether a harness can support broad professional computer work without being tuned to one narrow task family.

We evaluate ALE across multiple frontier models and agent harnesses, and the results show that the benchmark is far from saturated. The strongest configuration reported, Codex with GPT-5.5, is below 50% full-pass on the easiest tier and below 10% on the hardest. The average full-pass rate on the hardest tier is 2.6%.

The shared GCUA harness has to be general enough to support that range. In ALE, the common starting point is this structure:

Shared GCUA harness architecture: a main loop around a system-prompt builder, tool system, and context manager.
Shared GCUA harness architecture: a main loop around a system-prompt builder, tool system, and context manager. Reproduced from Figure 5 in our paper.
  • Main loop. Calls the model, dispatches actions, observes results, and repeats.
  • Prompt builder. Assembles task instructions, runtime metadata, tool guidance, and behavioral rules.
  • Tool system. Exposes shell, files, web, GUI actions, and sometimes background processes or sub-agent delegation.
  • Context manager. Keeps long trajectories inside the model context window.

This shared core is the starting point. The question is whether the product layer built around it, including memory, skills, preferences, and sub-agents, moves performance on ALE more than the model itself.

3. The Harness Moves Less Than the Model

ALE includes two useful sweeps:

  • Hold the harness fixed, using OpenClaw, and vary the model.
  • Hold the model fixed and vary the harness.
Model choice vs. harness choice on ALE.
Model choice vs. harness choice on ALE. Reproduced from Figure 9 in our paper.

The difference is stark.

SweepRangeSpread
Model sweep, fixed OpenClaw harness5.3% to 23.3% full-pass18.0 pp
Harness sweep, fixed GPT-5.519.3% to 25.3% full-pass6.0 pp
Harness sweep, fixed Claude Opus 4.714.7% to 20.0% full-pass5.3 pp

The model accounts for about the pass-rate spread of the harness.

4. ALE-Claw: The Minimal-Harness Test

That asymmetry gives us a simple ablation question: if harness choice moves the score less than model choice, can a small harness reach the frontier?

ALE-Claw is the subtraction test. It starts from the OpenClaw agent loop, then removes the machinery for a long-lived personal assistant: scheduled prompts, chat gateways, skills, plugin lifecycle hooks, user preferences, and long-term assistant memory. In our implementation, this simplification reduces the system prompt by roughly 65%. Section C.4 of the paper appendix and the GitHub code have more implementation detail.

What remains is the minimal harness:

  • Main loop. A single action loop for model calls, tool dispatch, observations, and final submission.
  • Prompt builder. A compact prompt focused on the current task, runtime, active tools, and task-local memory.
  • Tool system. File, shell, web, and GUI actions, with optional sub-agent delegation when a bounded side task is useful.
  • Context manager. Task-local memory and compaction for hours-long trajectories.

This gives us a clean test: after removing the product layer, how much performance and efficiency does ALE-Claw keep?

5. Results: What the Minimal Harness Shows

To test the minimal-harness hypothesis, we evaluated GPT-5.5 on ALE across five harnesses: ALE-Claw, OpenClaw, Codex, Cursor, and Droid. The two charts below are interactive and loaded live from our results database. The five harnesses are selected by default, and you can add or remove any agent we ran with the + Add model button.

5.1 Same Accuracy Band, Much Lower Cost

Holding the model fixed at GPT-5.5, the five harnesses land in roughly the same accuracy neighborhood. What changes much more is how much each harness spends to get there. Up and to the left is better: high score, low cost. Bubble size is total tokens; hover any bubble for its exact cost, time, and tokens, or use the toggle to switch the y-axis between pass rate, mean score, and both.

Comparing

The cleanest comparison is ALE-Claw vs OpenClaw: same code family, same model. Removing the product layer did not hurt mean score: ALE-Claw is slightly higher (0.485 vs 0.464) while using 44% fewer input tokens, 41% lower cost, and 60% less wall-clock time.

This is the central result: ALE-Claw reaches the same accuracy band while spending much less. The rest of the analysis asks why: do the harnesses actually solve different tasks, do their tool choices matter, and which product features actually earn their cost?

5.2 Same Model, Mostly Same Outcomes

Aggregate scores alone do not tell us whether two harnesses solve the same tasks. The sharper test is whether OpenClaw's product layer changes which tasks get solved. If it did, ALE-Claw and OpenClaw would disagree often when they run the same model. In practice, their outcomes are usually aligned.

With GPT-5.5:

  • Both fully solve about 19% of cases.
  • Both clearly fail, with score below 0.5, on about 44%.
  • Together, about 63% reach the same clear verdict.
  • Only about 5% flip full-pass status, and most of those flips favor ALE-Claw, the smaller harness.

Outside those buckets the two do earn different partial scores, but the pass/fail story is stable: on the same model, they largely agree on what is solvable.

That cuts against the idea that product-layer assistant features are what drive ALE performance. Memory, preference handling, skill routing, and heavy orchestration may sharpen a product experience, but here they rarely change whether the deliverable earns full credit.

5.3 Tool Surfaces Differ, Outcomes Do Not

The harnesses reach similar scores while using very different tool strategies.

Codex does almost everything through shell. Cursor uses dedicated file tools heavily and barely touches the GUI. OpenClaw routes a fifth of calls through a background-process category. Droid uses planning tools much more than the others and delegates routinely. The surface size also differs: in our tool-surface comparison, ALE-Claw exposes 13 visible tools, while Cursor exposes 30.

Yet all five sit inside a 6-point full-pass band.

The core reason is tool redundancy. Many harness tools are different interfaces to the same underlying operation: read a file, search a tree, edit text, run code, inspect output. A strong model can read a file through cat, a file-read tool, or an editor; it can patch through shell or a structured edit tool; it can sometimes avoid GUI work by using files or scripts. Once the primitive operations are covered, extra tools often change the route rather than the destination.

More tools are not automatically better. They increase the action space and the amount of tool documentation the model has to carry in context. That is why tool-use systems such as Gorilla and ToolLLM use retrieval to narrow large API catalogs instead of putting every possible tool in front of the model. Microsoft Research calls the failure mode tool-space interference: when overlapping tools are co-present, agents can spend more tokens, take longer action paths, recover more brittly from errors, or fail outright.

Sub-agents show the same pattern. With GPT-5.5, sub-agent calls appear in about 35% of Droid runs, about 6% of ALE-Claw runs, under 1% of Cursor runs, and 0% of OpenClaw or Codex runs. Droid delegates the most and scores the lowest, so delegation is not what drives the score band on these runs.

5.4 More Layers Cost More

Products like Claude Code, Codex, Cursor, and OpenClaw are built as assistants. They remember user preferences, adapt to interaction style, ask clarifying questions, maintain planning state, compact long conversations, and carry useful context across sessions.

The first problem is cost. Every added layer has to be prompted, logged, serialized, searched, sometimes summarized, and then interpreted by the model. Even when the layer is idle, its instructions and state can still consume context, add decision points, and slow the run.

The ALE-Claw vs OpenClaw comparison makes that tax visible. OpenClaw spends 71% more dollars per task and 147% more wall-clock time on the matched GPT-5.5 slice, without a better mean score.

The deeper problem is fit. An ALE run has no returning user to personalize for, no human to answer clarifying questions, and no second attempt that should inherit a preference from the first. The agent either produces the required artifact in this run or it does not. Product memory optimizes for an ongoing assistant relationship; ALE mostly asks for isolated task execution.

Context compaction shows why more layers do not automatically help. ALE-Claw and OpenClaw both watch trajectories against a 1M-token window and can summarize older turns when a run gets too large. The intended path is sensible: flush important facts to task memory, then compact older context into a smaller summary before continuing. That is useful insurance when work stretches across many sessions or related tasks. But in a single ALE run, the condition where it helps almost never appears: in the GPT-5.5 ALE-Claw logs, the threshold was never reached, and OpenClaw logged a compaction checkpoint in only about 2% of runs.

6. What's Next

For ALE-style evaluation with a capable model, a benchmark harness should be lean: expose the environment, keep context under control, give the model enough primitive tools to act, and get out of the way. ALE-Claw reaches the same accuracy neighborhood as heavier harnesses on far less time and money, which says the product layer was not the limiting factor here.

A lean harness also ages better. As Han-chung Lee argues, much of today's scaffolding exists to compensate for current model limitations, so a well-engineered 2026 harness is largely a 2026 artifact: the layers that look necessary now tend to dissolve into the next generation of models. The implication is to keep the harness thin and spend the hard engineering on the model and the evaluation, where the work compounds, rather than on scaffolding the next model may absorb.

Because harness changes move pass rate only modestly, that is not where harness work has the most leverage. The bigger gains are in efficiency and measurement: the same answer for fewer tokens, dollars, and minutes, and a clearer picture of which pieces of the harness actually carry weight.

Three directions look more valuable than adding more product features:

  • Test weaker models. Measure whether a richer harness helps weaker models more than it helps GPT-5.5. The model effect was measured under a fixed harness; the next test is whether harness engineering narrows the gap for weaker backbones.
  • Evaluate cross-run behavior. This is where layers like compaction and cross-session memory would finally be exercised. They sit idle on single-run ALE, as the compaction logs above show, because they are built for agents that revisit related tasks, retry, or carry context across runs. A dedicated cross-run evaluation is the right place to see whether they pay off.
  • Run real component ablations. ALE-Claw is a product-layer comparison, not a full component ablation. Remove compaction, GUI access, modular file tools, and sub-agent delegation one at a time to learn what actually changes outcomes.

Read the benchmark and the ALE-Claw harness.

View Agents' Last Exam on GitHub →