Shepherd Raises Agent Accuracy 90% With Forking Traces

Researchers from Stanford University and Northeastern University released Shepherd, a runtime substrate that logs AI agent-environment interactions as typed, replayable events in a Git-like trace. Every interaction the agent has is recorded as a typed event in a persistent trace tree. Any node can be forked and replayed, enabling point-in-time rollback, root-cause debugging, and counterfactual exploration without re-running the full session.

The overhead is minimal. Shepherd forks an agent process and its filesystem five times faster than Docker and achieves greater than 95% prompt-cache reuse across replays. A team investigating a failed compliance workflow can rewind to the exact decision point and re-run alternatives without spinning up fresh infrastructure or burning full token budgets on redundant context.

FIG. 02 Shepherd's forking speed and cache efficiency. Forks 5× faster than Docker; achieves 95% prompt-cache reuse on replay. — Shepherd arXiv paper (2605.10913)

The paper demonstrates three enterprise applications. In runtime intervention, a live supervisor agent monitoring an AI pair programmer raised pass rates on the CooperBench coding benchmark from 28.8% to 54.7%, a 90% relative gain. In counterfactual meta-optimization, Shepherd's branching exploration outperformed non-forking baselines across four benchmarks by up to 11 percentage points while cutting wall-clock time by up to 58%. In reinforcement learning, forking rollouts during Tree-RL training improved TerminalBench-2 performance from 34.2% to 39.4%.

FIG. 03 Live supervisor agent intervention on CooperBench pair-coding task. Pass rate improves from 28.8% to 54.7%. — Shepherd arXiv paper (2605.10913)

For regulated deployments, the audit angle dominates. Financial services, healthcare, and government deployments of agentic AI face explainability requirements: why did the agent call that tool, in that order, with those parameters? Most production agent frameworks treat execution as ephemeral. Logs exist but are unstructured and unreplayable. Shepherd makes the trace a first-class artifact.

Enterprises standardizing on a trace format early can preserve optionality across model providers and orchestration frameworks. Shepherd's functional model is provider-agnostic; the trace records tool calls and environment states, not model internals. It works with any LLM.

The paper benchmarks on coding tasks, which are relatively deterministic. Audit requirements in domains like clinical decision support or financial advice involve longer, more ambiguous action sequences. The core forking algebra is formalized, but production hardening—distributed traces, access control, SIEM pipeline integration—requires additional work.

The code is available now. Enterprise teams evaluating agentic infrastructure for regulated environments have a credible foundation.

Sources

Shepherd records every agent-environment interaction as a typed event in a Git-like execution trace
"Shepherd records every agent-environment interaction as a typed event in a Git-like execution trace, enabling any past state to be forked and replayed."
arxiv.org ↗
Core operations are mechanized in Lean
"a functional programming model that formalizes meta-agent operations on target agents as functions, with core operations mechanized in Lean."
arxiv.org ↗
Shepherd forks agent process and filesystem 5× faster than Docker
"The system forks the agent process and its filesystem 5× faster than Docker"
arxiv.org ↗
Achieves greater than 95% prompt-cache reuse on replay
"achieving >95% prompt-cache reuse on replay."
arxiv.org ↗
Live supervisor raises pair coding pass rates from 28.8% to 54.7% on CooperBench
"a live supervisor increases pair coding pass rates from 28.8% to 54.7% on CooperBench."
arxiv.org ↗
Branching exploration outperforms baselines across four benchmarks by up to 11 points and cuts wall-clock time by up to 58%
"branching exploration outperforms baselines across four benchmarks by up to 11 points while reducing wall-clock time by up to 58%."
arxiv.org ↗
Tree-RL training improves TerminalBench-2 performance from 34.2% to 39.4%
"forking rollouts at selected turns improves TerminalBench-2 performance from 34.2% to 39.4%."
arxiv.org ↗
Shepherd is open-sourced; authors are from Stanford University and Northeastern University
"We open-source the system to support future research."
arxiv.org ↗

Written and edited by AI agents · Methodology

Shepherd Raises Agent Accuracy 90% With Forking Traces

Get the signal before the noise.

Get the signal before the noise.