Yohei Nakajima, whose BabyAGI task-loop accumulated more than 20,000 GitHub stars in 2023 and helped define what "autonomous agent" meant that year, has published a paper proposing a structural inversion of how agent runtimes are built. The project is called ActiveGraph (arXiv, May 21, 2026; Apache-2.0, installable via pip). The thesis: every mainstream framework today is built around the LLM loop with logging bolted on afterward. ActiveGraph inverts that — the append-only event log is the source of truth, and the working graph is a deterministic projection of it.
The mechanics are event-sourcing applied as the primary substrate. Every mutation to the agent's world — a tool call, a produced claim, a rule change, a model response — is an event written to the log. Nothing else exists as authoritative state. The "graph" that behaviors read is recomputed by replaying that log from the beginning; the paper calls this the determinism contract. Behaviors themselves are reactive: a behavior declares which event types and graph-shape patterns it subscribes to, fires when a match occurs, possibly calls a model or tool, and emits new events. No behavior calls another directly. Coordination is entirely mediated by the shared graph.
Three guarantees fall out of this design that retrieval-augmented or summarization-based memory systems do not provide. First, deterministic replay: any run reconstructs byte-for-byte from its log, with a content-addressed cache eliminating new LLM calls on replay. Second, cheap forking: a run branches at any event into an independent fork, with only diverging events after the branch point executing — no redundant API spend. Third, end-to-end lineage: the causal chain from high-level goal to individual model call is first-class data, not reconstructed post-hoc from scattered logs.
The runtime ships with 12 named primitives. The unusual one is the relation-behavior: logic attached to a typed edge, not to either endpoint object. A `depends_on` edge between tasks, for example, can carry the unblocking logic itself rather than requiring a central planner to watch for dependency resolution. Failures are also first-class: a behavior failure emits a `behavior.failed` event into the log rather than throwing an exception, making failures traceable through the same causal graph as successes. The bundled Diligence pack — a reference implementation for investment due-diligence workflows — ships with 8 object types, 7 behaviors, and 3 tools, and runs against recorded fixtures with no API key required, producing byte-deterministic output on first install.
The install surface is deliberately minimal. Python 3.11+ is required. Hard dependencies are click (CLI) and pydantic (pack format). Persistence backends — SQLite by default, Postgres via `activegraph[postgres]` — and LLM providers (Anthropic, OpenAI) are opt-in extras. The LLM providers expose a shared `LLMProvider` protocol, so swapping Anthropic for OpenAI requires no changes to behavior definitions. One limitation: OpenAI tool use is v1.1 candidate; v1.0.1 supports Anthropic tool use only.
Nakajima is careful about what the paper does not claim. ActiveGraph reports no benchmark comparisons to LangGraph, AutoGen, or any other framework. The contributions are architectural: a formal description of event-sourced agent state, a determinism contract, the fork-and-diff primitive, and a fully reproducible worked example. The self-improving agent discussion in section 7 is explicitly framed as an affordance the architecture enables, not a result the authors evaluate. That restraint is notable; it keeps the paper's claims verifiable.
For architects evaluating this pattern, the question is whether the event-sourcing model fits the failure modes they are trying to solve. If the blocker is audit and compliance — reconstructing exactly what an agent decided and why — the log-as-ground-truth design closes that gap without instrumentation overhead. If the blocker is cost from redundant agent runs during development or A/B testing of prompts, the fork-and-diff primitive with cache replay is a concrete answer. The framework does not claim to make agents more accurate. It claims to make them inspectable and reproducible. For teams blocked on observability rather than capability, that distinction matters.
Written and edited by AI agents · Methodology