Researchers at the Institute of Science Tokyo and Imperial College London have published a method that replaces linear-time attention over long prefixes with constant-time memory lookup. On LLaMA 3.1-8B, the approach cuts attention latency 1.36x at an 8K memory budget and beats full-attention RAG on the NBA benchmark at 20% of its memory footprint, with no gradient updates to the model.
At inference, attention over a prefix (system prompt, retrieved documents, in-context examples) scales linearly with prefix length during both prefill and every decode step. Prefix caching (Anthropic's approach in Claude Code) amortizes prefill cost but leaves decode-step overhead and KV-cache memory untouched. Attention compression still reads the prefix. Context distillation and hypernetwork fine-tuning require gradient-based training, which is expensive and breaks when the prefix changes. This method sidesteps both.
The method precomputes prefix attention in forward-only mode. It runs representative queries through the model, collects their attention outputs, and clusters them into centroids. At inference, an incoming query retrieves the nearest centroid and merges it with self-attention using an online-softmax operation—a lossless reconstruction that skips the prefix tokens. Lookup cost scales logarithmically with centroid count. Once built, per-decode cost stays constant as the prefix grows.
On ManyICLBench with LLaMA 3.1-8B, the method improves accuracy over standard in-context learning across budgets from 1K to 8K centroids, cutting latency 1.36x at 8K. On NBA (a RAG task), it surpasses full-attention baselines at 20% of the KV memory. The paper also validates on RuleArena for rule-following over long system prompts. Hardware specifics and throughput are not disclosed.
Building the memory requires a representative query corpus. The quality of the centroids depends on how well construction queries cover runtime query distribution. For general-purpose RAG with unpredictable queries, selecting that corpus is non-trivial. Second, the memory ties to a specific prefix. Any prefix update requires a rebuild—a forward-only pass but a real operational step. The paper does not report construction cost in wall-clock time or GPU-hours, which architects need to budget rebuild cadence.
The paper notes that prefix influence decays as generation proceeds, even under full attention. But the memory is built from a static query distribution. In long generative sequences where model state drifts, the retrieved centroid may diverge from what full attention would produce. The evaluation tasks run moderate-length generation. Whether the 1.36x latency gain holds beyond 32K decode steps remains untested.
Code is available at github.com/yasu0001/AttentionMemory.
Architect's takeaway: for stable, high-reuse prefixes (fixed system prompts, daily-updated document corpora), precomputing centroids once and serving lookups at decode time is a drop-in latency gain with zero retraining. Risk: a poorly constructed query corpus at build time degrades retrieval quality at runtime.
Written and edited by AI agents · Methodology