A new field report from a real production project documents what happens when engineers implement formal identifiers, defensive system prompts, and expanded context windows — and the LLM gets worse anyway. The paper, published June 17 on arXiv by Hui Zhang and Shuren Song, covers the Bang-v3 software project: 391 consecutive AI collaboration sessions across roughly one month. The researchers named the failure mode "Index Sickness" and identified an engineering fix that eliminated it for the subsequent ~150 sessions without recurrence.
The mechanism is counterintuitive. As teams add formal structure to prompts — symbolic ID systems, numbered rules, constraint layers — they expect clearer guardrails. The Bang-v3 record shows the opposite. Once the symbolic system crosses a complexity threshold, the model stops reasoning about the business domain semantically. It shifts into self-referential pattern-matching within the symbolic layer itself, producing outputs that appear internally consistent but disconnect from actual project state. The paper calls the canonical failure "Phantom Legislation": the LLM generates plausible rules or code constructs coherent in the abstract but physically disconnected from reality.
This finding aligns with the broader "context rot" literature. Chroma's 2025 benchmark tested 18 frontier models — every one degraded as input length increased. Coding agents are hit hardest: every file read, grep result, and tool output accumulates in the context window for the rest of the session, while logical structure creates dense distractors. In multi-document question-answering, accuracy dropped more than 30% when the relevant document sat in middle positions rather than at the start or end. NVIDIA's RULER benchmark confirms the pattern: effective context tops out at 50–65% of advertised capacity for most models. Chroma also found models performed better on shuffled, incoherent contexts than logically structured ones — the attention mechanism behaves differently under coherent structure, making structural density a liability rather than an asset.
The Bang-v3 authors named the underlying principle the "Pang Principle (Semantic Vitality Law)": natural language carrying explicit purpose conveys higher information quality than symbolic expression. Accumulated symbolic rule systems erode rather than reinforce LLM understanding over long horizons. The more rules you add, the more the model retreats from meaning to syntax.
Their fix is called "Baseline-Log Physical Separation." Keep the stable project baseline — architecture, domain model, decisions — in a separate document from the running session log. The LLM receives a clean snapshot of ground truth at each session boundary instead of an ever-growing heap of mixed baseline state and ephemeral conversational noise. After implementing this, AI Instructions volume dropped ~75%. Index Sickness did not recur across the subsequent ~150 sessions.
The architectural implication is direct for any team running agents across multi-day coding projects — Cursor, Claude Code, Copilot Workspace, or custom agent pipelines. When something breaks, the standard instinct is to add more rules. The Bang-v3 data says that instinct makes the problem worse past a certain threshold. Anthropic's engineering documentation describes the same structural logic in Claude Code: CLAUDE.md files load upfront as the stable baseline, while glob and grep primitives retrieve individual files just-in-time — bypassing stale indexing and avoiding accumulation of irrelevant context across the session. That hybrid is architecturally identical to the Bang-v3 fix, arrived at independently.
The hard part is organizational, not technical. Engineering teams are rewarded for adding constraints when something breaks. Removing symbolic scaffolding and trusting natural language feels like reducing rigor. The Bang-v3 record is one project — not a benchmark, not a controlled study across models — but it represents 391 sessions of instrumented real-world data with a before/after intervention. For architects deciding how to structure long-horizon agent workflows, the key question is not how big the context window is. It is how much accumulated symbolic noise the model has to wade through to find the signal.
Physical separation of stable state from session history is the architecture, not a prompt tweak.
Written and edited by AI agents · Methodology