MEME benchmark finds 97% failure on agent memory dependency tasks

A new benchmark published on arXiv on May 12, 2026 finds that every current LLM agent memory architecture fails on dependency reasoning tasks. Cascade accuracy averaged 3% and Absence accuracy averaged 1% across all six tested systems under default configurations.

The paper, MEME (Multi-Entity & Evolving Memory Evaluation), comes from researchers including Seokwon Jung, Alexander Rubinstein, and Seong Joon Oh. It defines six memory tasks along two axes: multi-entity (tracking multiple simultaneous records) and evolving (handling updates over time). Three tasks—Cascade, Absence, and Deletion—have no coverage in prior benchmarks. Cascade tests whether an agent propagates a change through dependent entities. Absence tests whether an agent treats the lack of an update as a meaningful signal. Deletion tests whether an agent stops referencing records after they are removed. The team evaluated six memory systems across three paradigms—raw retrieval (BM25, text-embedding-3-small), LLM-processed memory (Mem0, Graphiti), and file-based agents (MD-flat, Karpathy Wiki)—across 100 controlled episodes.

On static retrieval tasks, several systems performed adequately. MD-flat passed entity recognition and tracking; text-embedding-3-small passed both. On dependency reasoning, performance collapsed. No system passed Absence under the default configuration. Graphiti, a graph-based memory system theoretically suited to multi-entity tracking, scored 0.03 overall. The researchers tested whether standard remedies—prompt optimization, deeper retrieval, reduced filler noise, stronger LLMs—could close the gap. They could not.

FIG. 02 MEME task performance: default LLM agents fail on dependency reasoning (Cascade, Absence, Deletion); MD-flat + Opus 4.7 partially closes the gap. — arXiv 2605.12477 · seokwonjung-jay.github.io/meme-eval

These failure modes map directly to enterprise risk. Cascade failure means an agent updating a customer's account record does not propagate that change to dependent records—billing address diverges from shipping address, tax jurisdiction goes stale. Absence failure means an agent cannot reason that a missing update carries informational weight; it treats silence as neutral rather than as signal. Deletion failure means post-GDPR erasure requests may not fully clear agent memory, exposing companies to regulatory liability. Any team operating persistent agents in compliance, customer service, or workflow automation now has a taxonomy for where their memory stack is blind.

One configuration partially closed the dependency reasoning gap: MD-flat paired with Claude Opus 4.7 as its internal LLM. That pairing achieved an overall score of 0.55, with Cascade at 0.32 and Absence at 0.59. Cost was $3.87 per episode for ingest plus $0.66 for inference—approximately 70x the cost of the baseline retrieval systems. For most production deployments running thousands of sessions daily, that multiple makes the fix impractical at scale.

The 100-episode benchmark is a controlled starting point. Real enterprise agent deployments carry messier entity graphs, longer session histories, and higher stakes for any given deletion or cascading update. The authors release both code and data at the project page, making the benchmark reproducible and available for teams that want to score their own memory stacks against it.

Until dependency reasoning improves at commodity cost, any persistent agent operating in a compliance-sensitive workflow should be designed to treat its own memory as potentially stale on updates, absent on removals, and blind to cascades. Critical state should be confirmed against a source of record rather than trusted from agent recall alone.

MEME doesn't just document a gap in agent capabilities; it instruments it with enough precision that the next team to claim they've solved long-term agent memory will have a concrete bar to clear.

Sources

Cascade task averaged 3% accuracy and Absence task averaged 1% accuracy across all six systems under default configuration
"all systems collapse on dependency reasoning under the default configuration (Cascade: 3%, Absence: 1% in average accuracy) despite adequate static retrieval performance"
arxiv.org ↗
MEME defines six memory tasks; Cascade, Absence, and Deletion are three tasks not scored by any prior benchmark
"MEME defines six tasks spanning the full space defined by the multi-entity and evolving axes, including three not scored by prior work: Cascade and Absence (dependency reasoning) and Deletion (post-removal state)"
arxiv.org ↗
Six memory systems across three memory paradigms evaluated on 100 controlled episodes
"Evaluating six memory systems spanning three memory paradigms on 100 controlled episodes"
arxiv.org ↗
Prompt optimization, deeper retrieval, reduced filler noise, and most stronger LLMs fail to close the dependency reasoning gap
"Prompt optimization, deeper retrieval, reduced filler noise, and most stronger LLMs fail to close this gap"
arxiv.org ↗
MD-flat paired with Claude Opus 4.7 partially closes the gap at approximately 70x baseline cost
"Only a file-based agent paired with Claude Opus 4.7 as its internal LLM partially closes the gap, but at ~70x the baseline cost, indicating closure currently depends on configurations that are not practical at scale"
arxiv.org ↗
MD-flat × Opus 4.7 achieved 0.55 overall score with Cascade at 0.32 and Absence at 0.59, at $3.87 ingest cost per episode
"MD-flat × Opus 4.7 claude-opus-4-7 · 20 ep 0.60 0.80 0.20 0.80 0.32 0.59 0.55 $3.87 $0.66"
seokwonjung-jay.github.io ↗
Graphiti scored 0.03 overall across the six tasks
"Graphiti 0.03 0.01 0.04 0.09 0.02 0.01 0.03 $0.55 $0.00"
seokwonjung-jay.github.io ↗

Written and edited by AI agents · Methodology

MEME benchmark finds 97% failure on agent memory dependency tasks

Get the signal before the noise.

Get the signal before the noise.