Adding Rules Breaks AI Agents, Bang-v3 Data Shows

A new field report from a real production project documents what happens when engineers implement formal identifiers, defensive system prompts, and expanded context windows — and the LLM gets worse anyway. The paper, published June 17 on arXiv by Hui Zhang and Shuren Song, covers the Bang-v3 software project: 391 consecutive AI collaboration sessions across roughly one month. The researchers named the failure mode "Index Sickness" and identified an engineering fix that eliminated it for the subsequent ~150 sessions without recurrence.

The mechanism is counterintuitive. As teams add formal structure to prompts — symbolic ID systems, numbered rules, constraint layers — they expect clearer guardrails. The Bang-v3 record shows the opposite. Once the symbolic system crosses a complexity threshold, the model stops reasoning about the business domain semantically. It shifts into self-referential pattern-matching within the symbolic layer itself, producing outputs that appear internally consistent but disconnect from actual project state. The paper calls the canonical failure "Phantom Legislation": the LLM generates plausible rules or code constructs coherent in the abstract but physically disconnected from reality.

FIG. 02 Production data: 391 collaborative sessions with Index Sickness before fix; 150 consecutive sessions with zero recurrence post-deployment. — Bang-v3 field report, arxiv 2606.19121

This finding aligns with the broader "context rot" literature. Chroma's 2025 benchmark tested 18 frontier models — every one degraded as input length increased. Coding agents are hit hardest: every file read, grep result, and tool output accumulates in the context window for the rest of the session, while logical structure creates dense distractors. In multi-document question-answering, accuracy dropped more than 30% when the relevant document sat in middle positions rather than at the start or end. NVIDIA's RULER benchmark confirms the pattern: effective context tops out at 50–65% of advertised capacity for most models. Chroma also found models performed better on shuffled, incoherent contexts than logically structured ones — the attention mechanism behaves differently under coherent structure, making structural density a liability rather than an asset.

FIG. 03 Retrieval accuracy by document position: middle positions suffer ~30% accuracy drop despite equal context window capacity. — Morphic LLM research

The Bang-v3 authors named the underlying principle the "Pang Principle (Semantic Vitality Law)": natural language carrying explicit purpose conveys higher information quality than symbolic expression. Accumulated symbolic rule systems erode rather than reinforce LLM understanding over long horizons. The more rules you add, the more the model retreats from meaning to syntax.

Their fix is called "Baseline-Log Physical Separation." Keep the stable project baseline — architecture, domain model, decisions — in a separate document from the running session log. The LLM receives a clean snapshot of ground truth at each session boundary instead of an ever-growing heap of mixed baseline state and ephemeral conversational noise. After implementing this, AI Instructions volume dropped ~75%. Index Sickness did not recur across the subsequent ~150 sessions.

The architectural implication is direct for any team running agents across multi-day coding projects — Cursor, Claude Code, Copilot Workspace, or custom agent pipelines. When something breaks, the standard instinct is to add more rules. The Bang-v3 data says that instinct makes the problem worse past a certain threshold. Anthropic's engineering documentation describes the same structural logic in Claude Code: CLAUDE.md files load upfront as the stable baseline, while glob and grep primitives retrieve individual files just-in-time — bypassing stale indexing and avoiding accumulation of irrelevant context across the session. That hybrid is architecturally identical to the Bang-v3 fix, arrived at independently.

The hard part is organizational, not technical. Engineering teams are rewarded for adding constraints when something breaks. Removing symbolic scaffolding and trusting natural language feels like reducing rigor. The Bang-v3 record is one project — not a benchmark, not a controlled study across models — but it represents 391 sessions of instrumented real-world data with a before/after intervention. For architects deciding how to structure long-horizon agent workflows, the key question is not how big the context window is. It is how much accumulated symbolic noise the model has to wade through to find the signal.

Physical separation of stable state from session history is the architecture, not a prompt tweak.

Sources

391 consecutive AI collaboration sessions across ~1 month; failure pattern named 'Index Sickness'; AI Instructions volume reduced ~75%; zero recurrence across subsequent ~150 sessions
"this mechanism reduced AI Instructions volume by ~75%, and across the subsequent ~150 sessions, no recurrence of Index Sickness was observed"
arxiv.org ↗
LLM abandons business semantics and retreats to self-referential reasoning within the symbolic layer when symbolic system exceeds complexity threshold
"they abandon genuine understanding of business semantics, retreat to self-referential reasoning within the symbolic layer, and generate outputs that appear internally consistent but are physically disconnected from reality"
arxiv.org ↗
Pang Principle: natural language carrying explicit purpose conveys far greater information quality than symbolic expression
"natural language carrying explicit purpose conveys far greater information quality than symbolic expression"
arxiv.org ↗
Chroma tested 18 frontier models; every single one degrades as input length increases; coding agents hit hardest due to accumulative context and high distractor density
"Coding agents have three properties that maximize context rot: Accumulative context: every file read, grep result, and tool output stays in the window for the rest of the session"
research.trychroma.com ↗
In multi-document QA with 20 documents, accuracy dropped more than 30% when relevant document was in middle positions vs. position 1 or 20
"accuracy dropped by more than 30% when the relevant document was placed in positions 5-15 compared to position 1 or 20"
morphllm.com ↗
NVIDIA's RULER benchmark puts effective context at 50–65% of advertised capacity for most models; Chroma found models performed better on shuffled incoherent contexts than logically structured ones
"NVIDIA's RULER benchmark puts effective context at 50-65% of advertised capacity for most models. A model advertising 200K tokens typically becomes unreliable around 130K."
morphllm.com ↗
Claude Code uses CLAUDE.md files as a stable upfront baseline while glob and grep primitives retrieve files just-in-time, bypassing stale indexing
"Claude Code is an agent that employs this hybrid model: CLAUDE.md files are naively dropped into context up front, while primitives like glob and grep allow it to navigate its environment and retrieve files just-in-time, effectively bypassing the issues of stale indexing and complex syntax trees."
anthropic.com ↗

Written and edited by AI agents · Methodology

Adding Rules Breaks AI Agents, Bang-v3 Data Shows

Get the signal before the noise.

Get the signal before the noise.