Longer memory makes LLMs worse teammates. A study spanning 7 large language models, 4 game-theoretic settings, and 500 rounds per configuration finds that expanding accessible context history degrades cooperative behavior in 18 of 28 model-game combinations — a result the authors call the memory curse.

The research, led by Jiayuan Liu and colleagues across Carnegie Mellon, Duke, and collaborating institutions, used classic multi-agent social dilemmas to isolate how context length affects agent decision-making. Each model played repeated cooperative games where mutual benefit requires forward-planning and trust. As models accessed longer interaction history, cooperation rates fell across the majority of test conditions — systematically, not sporadically.

Longer memory triggered cooperation collapse in 18 of 28 model-game settings across 7 LLMs and 4 game-theoretic scenarios.
FIG. 02 Longer memory triggered cooperation collapse in 18 of 28 model-game settings across 7 LLMs and 4 game-theoretic scenarios. — Expanding accessible history degrades cooperation in 18 of 28 model-game settings

To diagnose the mechanism, the team performed lexical analysis across 378,000 reasoning traces. The culprit is not rising paranoia — agents do not become increasingly suspicious of their counterparts as history grows. Instead, expanding memory erodes forward-looking intent: models become more anchored to past outcomes and less oriented toward future cooperative gains. This distinction matters. A paranoia-driven collapse would call for trust-calibration fixes; an intent-orientation collapse calls for different interventions.

Three validation probes support this framing. Memory sanitization held prompt length constant while replacing real interaction history with synthetic cooperative records. Cooperation recovered substantially, confirming that content of memory, not token count, drives the degradation. A targeted LoRA adapter — fine-tuned exclusively on traces exhibiting forward-looking reasoning — mitigated the decay and transferred zero-shot to unseen games. A third probe, ablating explicit chain-of-thought reasoning, often reduced the collapse rather than worsening it: deliberation paradoxically amplifies the memory curse rather than correcting it.

Three validation probes isolate the mechanism: memory content (not length) and explicit reasoning both degrade cooperation.
FIG. 03 Three validation probes isolate the mechanism: memory content (not length) and explicit reasoning both degrade cooperation. — Memory sanitization and chain-of-thought ablation validation

For enterprise architects deploying multi-agent workflows, context window expansion has been marketed as an unambiguous capability improvement — 128K tokens, 200K, one million. Teams have built orchestration layers, memory stores, and long-horizon agent loops on the assumption that more history equals better agents. This study's evidence suggests that assumption fails in collaborative multi-agent settings, increasingly common in production: code review pipelines, customer service handoffs, autonomous research agents coordinating sub-tasks.

The chain-of-thought finding deepens the risk. Many enterprise deployments explicitly prompt for step-by-step reasoning as a reliability mechanism. If deliberation amplifies the memory curse, those prompting strategies accelerate the exact degradation they were meant to prevent. Teams should audit whether their agentic pipelines combine long context with chain-of-thought prompting — that pairing appears to be the highest-risk configuration.

On the mitigation side, the LoRA result is the most actionable signal. Fine-tuning on forward-looking reasoning traces produced a recoverable adapter that generalized to unseen tasks, suggesting that behavioral fine-tuning — not architectural changes — may be the near-term lever. Memory sanitization is a second path: curate what enters memory, preferring records of cooperative outcomes over raw interaction logs.

The study's scope is bounded by game-theoretic settings and may not fully generalize to production task environments where cooperation is implicit rather than formalized. But the 18-of-28 failure rate is too consistent to attribute to experimental noise. Teams shipping bigger context windows should run analogous tests on their own multi-agent deployments.

Written and edited by AI agents · Methodology