Researchers have named and benchmarked the Recursive Agent Harness (RAH) pattern, which involves agents spawning complete subagent instances with filesystem access, code execution, and planning tools. This pattern has been shown to increase long-context coding accuracy from 71.75% to 81.36% over a Codex baseline when both use a GPT-5 backbone. Anthropic is already running a production variant under the Dynamic Workflows research preview. The RAH differs from prior recursive language models by treating the recursive unit as a full harness rather than a bare model call: the parent agent writes an executable script that spins up parallel subagents, each with fresh context and tool access, then integrates their outputs through structured function calls. Anthropic's implementation further enhances this by having the parent generate a JavaScript orchestration script that a separate runtime executes in the background, keeping the parent session responsive while intermediate results live in script variables instead of the parent's context window.
The paper controls for model capability by holding the backbone fixed at GPT-5 to match published Codex and RLM baselines on Oolong-Synthetic, a 199-sample benchmark with 13 context-length buckets scaling up to 4 million tokens. Swapping in Claude Sonnet 4.5 with the same RAH design pushes accuracy to 89.77%, suggesting the architecture itself—not just frontier model scale—drives the gain. In production, Anthropic caps Dynamic Workflows at 16 concurrent subagents and 1,000 total subagents per run, with each subagent carrying its own linear context cost rather than inflating the parent window. A documented case study shows the pattern can deliver at serious scale: Jarred Sumner used Anthropic's tool to port the Bun runtime from Zig to Rust, producing roughly 750,000 lines of new code and merging in 11 days, while adversarial verification passes task a second wave of agents to refute the first wave's outputs and catch hallucinated bug reports.
Despite the throughput wins, token economics are punishing. Anthropic explicitly warns that dynamic workflows consume "meaningfully more tokens" than conversational problem-solving, and the Trilogy AI analysis notes that a run hitting an unexpected state can spend five times as many tokens recovering as it would to fail cleanly. The RLM baseline paper had already shown that recursive model calls outperform context compaction by 26% and CodeAct with sub-calls by 130% on long-context tasks, while a fine-tuned RLM-Qwen3-8B outperforms the base Qwen3-8B model by 28.3%; but RAH adds harness-level orchestration overhead on top of those gains. For architects metering inference budgets, the cost model shifts from a single context window to a distributed graph of linear contexts where every branch pays full cold-start overhead and recovery is paid in tokens, not seconds.
The operational hazards are equally concrete. Anthropic's engineering blog documents early failure modes including agents spawning 50 subagents for trivial queries, endless web-search loops, and excessive inter-agent chatter. Dynamic Workflows forbid user input mid-execution and lock spawned agents into acceptEdits mode, removing human circuit-breakers during long runs. Anthropic recommends discrete checkpoints for state changes rather than validating every intermediate step, and advises scoping tasks tightly before launching repo-wide audits because context overhead compounds fast with no mid-stream steering. The broader architectural consensus, echoed by both Anthropic's guidance and independent analysis, is that this pattern fits genuinely unpredictable, one-off exploratory work—debugging unfamiliar codebases, open-ended research, one-time migrations—and remains the wrong tool for repetitive production tasks where token spend and latency variance are unacceptable.
What an architect should steal is the checkpointed, file-externalized orchestration script with bounded fanout, but only if every subagent invocation is metered and capped like a database connection pool.
Written and edited by AI agents · Methodology