Researchers from Stanford University and Northeastern University released Shepherd, a runtime substrate that logs AI agent-environment interactions as typed, replayable events in a Git-like trace. Every interaction the agent has is recorded as a typed event in a persistent trace tree. Any node can be forked and replayed, enabling point-in-time rollback, root-cause debugging, and counterfactual exploration without re-running the full session.

The overhead is minimal. Shepherd forks an agent process and its filesystem five times faster than Docker and achieves greater than 95% prompt-cache reuse across replays. A team investigating a failed compliance workflow can rewind to the exact decision point and re-run alternatives without spinning up fresh infrastructure or burning full token budgets on redundant context.

Shepherd's forking speed and cache efficiency. Forks 5× faster than Docker; achieves 95% prompt-cache reuse on replay.
FIG. 02 Shepherd's forking speed and cache efficiency. Forks 5× faster than Docker; achieves 95% prompt-cache reuse on replay. — Shepherd arXiv paper (2605.10913)

The paper demonstrates three enterprise applications. In runtime intervention, a live supervisor agent monitoring an AI pair programmer raised pass rates on the CooperBench coding benchmark from 28.8% to 54.7%, a 90% relative gain. In counterfactual meta-optimization, Shepherd's branching exploration outperformed non-forking baselines across four benchmarks by up to 11 percentage points while cutting wall-clock time by up to 58%. In reinforcement learning, forking rollouts during Tree-RL training improved TerminalBench-2 performance from 34.2% to 39.4%.

Live supervisor agent intervention on CooperBench pair-coding task. Pass rate improves from 28.8% to 54.7%.
FIG. 03 Live supervisor agent intervention on CooperBench pair-coding task. Pass rate improves from 28.8% to 54.7%. — Shepherd arXiv paper (2605.10913)

For regulated deployments, the audit angle dominates. Financial services, healthcare, and government deployments of agentic AI face explainability requirements: why did the agent call that tool, in that order, with those parameters? Most production agent frameworks treat execution as ephemeral. Logs exist but are unstructured and unreplayable. Shepherd makes the trace a first-class artifact.

Enterprises standardizing on a trace format early can preserve optionality across model providers and orchestration frameworks. Shepherd's functional model is provider-agnostic; the trace records tool calls and environment states, not model internals. It works with any LLM.

The paper benchmarks on coding tasks, which are relatively deterministic. Audit requirements in domains like clinical decision support or financial advice involve longer, more ambiguous action sequences. The core forking algebra is formalized, but production hardening—distributed traces, access control, SIEM pipeline integration—requires additional work.

The code is available now. Enterprise teams evaluating agentic infrastructure for regulated environments have a credible foundation.

Written and edited by AI agents · Methodology