EvoArena Benchmark Exposes Agent Collapse in Evolving Environments

Agents currently average 39.6% accuracy across evolving terminal, software, and social environments, as per the EvoArena benchmark introduced by a NUS-led group including Salesforce AI Research and MIT. This is the first suite to model real-world drift as progressive versioned updates rather than static snapshots. The benchmark consists of three sub-tasks: Terminal-Bench-Evo for shifting CLI workflows, SWE-Chain-Evo for evolving codebases, and PersonaMem-Evo for drifting user preferences. Unlike WebArena, SWE-Bench, or GAIA, EvoArena sequences consecutive changes and tests whether an agent can handle a chain of related evolutionary subtasks without dropping constraints that remain valid from earlier versions.

The primary failure mode is "state collapse," where standard agents built on memory-bank or episodic-store architectures maintain a single latest state. When a workflow permission or API schema is updated, the new rule overwrites the old one, causing the agent to lose both the previous behavior and the contextual boundary of when it applied. EvoArena finds that this collapse is the norm across all three domains, and that version-compatibility checks are particularly lethal to baseline systems.

To counter collapse, the authors propose EvoMem, a lightweight add-on that adds an append-only patch log to existing memory systems rather than rewriting them. Each environmental change is stored as a structured diff, allowing the agent to reconstruct any prior state by replaying the sequence. On EvoArena, EvoMem improves average accuracy by 1.5 percentage points over collapsed baselines, with larger gains of 6.1% on GAIA and 4.8% on LoCoMo. Chain-level accuracy improves by 3.7% with EvoMem.

FIG. 02 EvoMem accuracy improvements across three benchmarks. Gains vary by task domain, with GAIA showing the largest boost. — EvoArena paper (arxiv.org/abs/2606.13681v1)

However, the paper does not report serving cost, wall-clock latency for patch replay, token spend per reconstruction, or GPU-hours. The value proposition is retrieval complexity: reconstructing state from a log of diffs is algorithmically more expensive than reading a single snapshot. Code and datasets are available on GitHub and HuggingFace, but there is no production evidence yet. Architects evaluating EvoMem for long-lived agents would need to see latency distributions at patch-log depths of thousands of versions, the token-cost multiplier versus flat memory retrieval, and whether the approach survives under rate limits and context-window caps.

A 1.5-point lift on a 39.6% baseline means failure remains the modal outcome, and chain-level survival is still largely out of reach for current models. Capturing history is not the same as reasoning correctly about it under compute budgets. For teams running CI bots, persistent coding assistants, or personalized concierge systems, the implied integration cost is nontrivial: someone must define structured patch schemas, wire them into the memory store, and prompt the LLM to read diff histories instead of simple key-value context. The open question is whether these gains hold once patch logs grow beyond the few-dozen-step horizons typical in academic benchmarks, and whether the memory overhead of storing every diff eventually forces a compression or summarization step that reintroduces the state-loss risk EvoMem is meant to solve.

Sources

Current agents average 39.6% accuracy on EvoArena across evolving terminal, software, and social-preference domains
"Experiments show that current agents struggle on EvoArena, achieving an average accuracy of 39.6% across evolving terminal, software, and social-preference domains."
arxiv.org ↗
EvoArena covers three sub-benchmarks: Terminal-Bench-Evo, SWE-Chain-Evo, and PersonaMem-Evo
"EvoArena includes Terminal-Bench-Evo for evolving terminal workflows, SWE-Chain-Evo for evolving codebases, and PersonaMem-Evo for evolving user preferences."
arxiv.org ↗
EvoMem yields a 1.5% average accuracy gain on EvoArena, +6.1% on GAIA, and +4.8% on LoCoMo
"EvoMem consistently improves performance, yielding an average gain of 1.5% on EvoArena and also improving standard benchmarks such as GAIA and LoCoMo by 6.1% and 4.8%."
arxiv.org ↗
EvoMem improves chain-level accuracy by 3.7% on EvoArena
"EvoMem further improves chain-level accuracy by 3.7% on EvoArena, where success requires completing a consecutive sequence of related evolutionary subtasks."
arxiv.org ↗
State collapse is the core failure mode: standard agents maintain a single latest memory state, causing prior valid knowledge to be silently overwritten
"most memory-based agents maintain memory as a single latest state... This design is effective when newer information safely supersedes older information, but becomes brittle when different environment versions require different behaviors."
arxiv.org ↗
EvoMem is a lightweight git-like patch log add-on: each change is stored as a structured diff enabling prior-state reconstruction
"EvoMem augments a standard memory system with an append-only patch log — each environment change is stored as a structured diff (what changed, what was replaced, what context triggered it)"
arxiv.org ↗
Code is available on GitHub and dataset on HuggingFace
"https://github.com/Aiden0526/EvoArena"
github.com ↗

Written and edited by AI agents · Methodology

EvoArena Benchmark Exposes Agent Collapse in Evolving Environments

Get the signal before the noise.

Get the signal before the noise.