Recursive Agent Harness Achieves 89% Accuracy on Long-Context Code Tasks

Researchers have named and benchmarked the Recursive Agent Harness (RAH) pattern, which involves agents spawning complete subagent instances with filesystem access, code execution, and planning tools. This pattern has been shown to increase long-context coding accuracy from 71.75% to 81.36% over a Codex baseline when both use a GPT-5 backbone. Anthropic is already running a production variant under the Dynamic Workflows research preview. The RAH differs from prior recursive language models by treating the recursive unit as a full harness rather than a bare model call: the parent agent writes an executable script that spins up parallel subagents, each with fresh context and tool access, then integrates their outputs through structured function calls. Anthropic's implementation further enhances this by having the parent generate a JavaScript orchestration script that a separate runtime executes in the background, keeping the parent session responsive while intermediate results live in script variables instead of the parent's context window.

FIG. 02 Accuracy progression: Codex baseline (71.75%) to RAH with GPT-5 (81.36%) to Claude Sonnet 4.5 (89.77%) on 199 long-context samples. — ai|expert analysis, source: arxiv.org/abs/2606.13643v1

The paper controls for model capability by holding the backbone fixed at GPT-5 to match published Codex and RLM baselines on Oolong-Synthetic, a 199-sample benchmark with 13 context-length buckets scaling up to 4 million tokens. Swapping in Claude Sonnet 4.5 with the same RAH design pushes accuracy to 89.77%, suggesting the architecture itself—not just frontier model scale—drives the gain. In production, Anthropic caps Dynamic Workflows at 16 concurrent subagents and 1,000 total subagents per run, with each subagent carrying its own linear context cost rather than inflating the parent window. A documented case study shows the pattern can deliver at serious scale: Jarred Sumner used Anthropic's tool to port the Bun runtime from Zig to Rust, producing roughly 750,000 lines of new code and merging in 11 days, while adversarial verification passes task a second wave of agents to refute the first wave's outputs and catch hallucinated bug reports.

Despite the throughput wins, token economics are punishing. Anthropic explicitly warns that dynamic workflows consume "meaningfully more tokens" than conversational problem-solving, and the Trilogy AI analysis notes that a run hitting an unexpected state can spend five times as many tokens recovering as it would to fail cleanly. The RLM baseline paper had already shown that recursive model calls outperform context compaction by 26% and CodeAct with sub-calls by 130% on long-context tasks, while a fine-tuned RLM-Qwen3-8B outperforms the base Qwen3-8B model by 28.3%; but RAH adds harness-level orchestration overhead on top of those gains. For architects metering inference budgets, the cost model shifts from a single context window to a distributed graph of linear contexts where every branch pays full cold-start overhead and recovery is paid in tokens, not seconds.

The operational hazards are equally concrete. Anthropic's engineering blog documents early failure modes including agents spawning 50 subagents for trivial queries, endless web-search loops, and excessive inter-agent chatter. Dynamic Workflows forbid user input mid-execution and lock spawned agents into acceptEdits mode, removing human circuit-breakers during long runs. Anthropic recommends discrete checkpoints for state changes rather than validating every intermediate step, and advises scoping tasks tightly before launching repo-wide audits because context overhead compounds fast with no mid-stream steering. The broader architectural consensus, echoed by both Anthropic's guidance and independent analysis, is that this pattern fits genuinely unpredictable, one-off exploratory work—debugging unfamiliar codebases, open-ended research, one-time migrations—and remains the wrong tool for repetitive production tasks where token spend and latency variance are unacceptable.

What an architect should steal is the checkpointed, file-externalized orchestration script with bounded fanout, but only if every subagent invocation is metered and capped like a database connection pool.

Sources

RAH improves Codex coding-agent baseline from 71.75% to 81.36% on Oolong-Synthetic (199 samples, 13 context-length buckets up to 4M tokens) with GPT-5 backbone; Claude Sonnet 4.5 reaches 89.77%
"RAH improves the Codex coding-agent baseline from 71.75% to 81.36% on Oolong-Synthetic (199 samples, 13 context-length buckets up to 4M tokens), a gain attributable to the harness rather than the model. With a stronger backbone, Claude Sonnet 4.5, the same design reaches 89.77%."
arxiv.org ↗
RAH treats the recursive unit as a full agent harness with filesystem tools, code execution, and planning — called harness recursion, the code-first extension to model recursion
"We call this the Recursive Agent Harness (RAH) and frame it as harness recursion, the code-first extension to the model recursion of RLMs. A parent agent generates and runs an executable script that spawns subagent harnesses in parallel for fine-grained workloads and uses structured function calls for small subtasks."
arxiv.org ↗
Anthropic Dynamic Workflows: Claude writes a JavaScript orchestration script; a separate runtime executes it in the background; intermediate results live in script variables, not Claude's context window; 16 concurrent / 1,000 total subagents per run
"Claude writes a JavaScript orchestration script on the fly, starting from the natural-language request, and a separate runtime executes that script in the background, spinning up dozens to hundreds of subagents in parallel. The chat session stays responsive while the agents work, intermediate results do not saturate Claude's context window because they live inside script variables."
pasqualepillitteri.it ↗
Dynamic Workflows token warning: every agent pays its own context overhead; Anthropic recommends starting with scoped tasks; no user input mid-execution; spawned agents locked in acceptEdits mode
"Token warning: a workflow can consume substantially more tokens than a standard Claude Code session, because every agent pays its own context overhead. Anthropic recommends starting with a well-scoped task to calibrate consumption before launching repository-wide audits or migrations across thousands of files."
pasqualepillitteri.it ↗
The plan moving out of the conversation into a file — Anthropic's architectural shift with dynamic workflows
"A dynamic workflow breaks that. It is, in Anthropic's own flat description, 'a JavaScript script that orchestrates subagents at scale.' Claude writes the script, and a separate runtime executes it in the background while your session stays free."
trilogyai.substack.com ↗
A run hitting a snag can spend 5× more tokens recovering; Anthropic warns workflows use meaningfully more tokens than conversational work
"In the wild, the costs compound: every decision step carries accumulated context, and a run that hits a snag 'might spend 5x more tokens recovering' instead of failing cleanly. Anthropic itself warns that a workflow 'can use meaningfully more tokens than working through the same task in conversation.'"
trilogyai.substack.com ↗
Bun runtime ported from Zig to Rust using dynamic workflows: ~750,000 lines of new Rust code, first commit to merge in 11 days
"Jarred Sumner used dynamic workflows to port the entire Bun runtime from Zig to Rust — roughly 750,000 lines of new Rust code. One workflow mapped lifetimes for every struct field. Another spun up hundreds of agents to write the actual .rs files with two reviewers per file. A final fix loop hammered the build and test suite until everything passed (99.8% of the original test suite). The whole port went from first commit to merge in eleven days."
quasa.io ↗
Each subagent in dynamic workflows has its own fresh context; token usage stays linear in fanout rather than quadratic; adversarial verification phase catches hallucinated bug reports
"Each subagent has its own fresh context, so token usage stays linear in fan-out rather than quadratic. Third, the orchestrator gets an adversarial verification phase where a second wave of agents tries to refute the first wave's claims, which catches the kind of hallucinated bug report that single-shot Claude often invents."
contentbuffer.com ↗
RLMs outperform context compaction by 26% median, CodeAct with sub-calls by 130%, and Claude Code by 13% across four long-context tasks; RLM-Qwen3-8B outperforms base Qwen3-8B by 28.3%
"RLMs can successfully process inputs up to two orders of magnitude beyond model context windows and, even for shorter prompts, dramatically outperform the quality of vanilla frontier LLMs and common long-context and coding scaffolds (e.g., on GPT-5 by a median across the evaluated benchmarks of 26% against compaction, 130% against CodeAct with sub-calls, and 13% against Claude Code)."
arxiv.org ↗
Early RAH/multi-agent failure modes: agents spawning 50 subagents for simple queries, endless web search, excessive inter-agent chatter; Anthropic recommends checkpoints over validating every step
"Early agents made errors like spawning 50 subagents for simple queries, scouring the web endlessly for nonexistent sources, and distracting each other with excessive updates."
anthropic.com ↗

Written and edited by AI agents · Methodology

Recursive Agent Harness Achieves 89% Accuracy on Long-Context Code Tasks

Get the signal before the noise.

Get the signal before the noise.