Memory Lookup Replaces Linear Attention Over Long Prefixes

Researchers propose internalizing long conditioning prefixes (system prompts, in-context examples) into a learned memory mechanism rather than computing full attention over them each decode step, cutting quadratic compute to constant. Architect angle: RAG and few-shot prompt systems that prepend large contexts now face linear scaling; context memorization could unlock 10x longer prefixes without latency blowup—a key lever for production eval harnesses and agentic orchestration.

Researchers at the Institute of Science Tokyo and Imperial College London have published a method that replaces linear-time attention over long prefixes with constant-time memory lookup. On LLaMA 3.1-8B, the approach cuts attention latency 1.36x at an 8K memory budget and beats full-attention RAG on the NBA benchmark at 20% of its memory footprint, with no gradient updates to the model.

At inference, attention over a prefix (system prompt, retrieved documents, in-context examples) scales linearly with prefix length during both prefill and every decode step. Prefix caching (Anthropic's approach in Claude Code) amortizes prefill cost but leaves decode-step overhead and KV-cache memory untouched. Attention compression still reads the prefix. Context distillation and hypernetwork fine-tuning require gradient-based training, which is expensive and breaks when the prefix changes. This method sidesteps both.

FIG. 02 Standard prefix attention scales linearly; the memory-lookup method reduces this to logarithmic scaling independent of prefix length. — arxiv.org/html/2605.18226v1

The method precomputes prefix attention in forward-only mode. It runs representative queries through the model, collects their attention outputs, and clusters them into centroids. At inference, an incoming query retrieves the nearest centroid and merges it with self-attention using an online-softmax operation—a lossless reconstruction that skips the prefix tokens. Lookup cost scales logarithmically with centroid count. Once built, per-decode cost stays constant as the prefix grows.

On ManyICLBench with LLaMA 3.1-8B, the method improves accuracy over standard in-context learning across budgets from 1K to 8K centroids, cutting latency 1.36x at 8K. On NBA (a RAG task), it surpasses full-attention baselines at 20% of the KV memory. The paper also validates on RuleArena for rule-following over long system prompts. Hardware specifics and throughput are not disclosed.

FIG. 03 Latency speedup (1.36x) and memory savings (20% footprint) of prefix-memory lookup vs. full-attention baseline on ManyICLBench. — arxiv.org/abs/2605.18226v1

Building the memory requires a representative query corpus. The quality of the centroids depends on how well construction queries cover runtime query distribution. For general-purpose RAG with unpredictable queries, selecting that corpus is non-trivial. Second, the memory ties to a specific prefix. Any prefix update requires a rebuild—a forward-only pass but a real operational step. The paper does not report construction cost in wall-clock time or GPU-hours, which architects need to budget rebuild cadence.

The paper notes that prefix influence decays as generation proceeds, even under full attention. But the memory is built from a static query distribution. In long generative sequences where model state drifts, the retrieved centroid may diverge from what full attention would produce. The evaluation tasks run moderate-length generation. Whether the 1.36x latency gain holds beyond 32K decode steps remains untested.

Code is available at github.com/yasu0001/AttentionMemory.

Architect's takeaway: for stable, high-reuse prefixes (fixed system prompts, daily-updated document corpora), precomputing centroids once and serving lookups at decode time is a drop-in latency gain with zero retraining. Risk: a poorly constructed query corpus at build time degrades retrieval quality at runtime.

Sources

Attention-state memory reduces attention latency by 1.36x at 8K memory budget on ManyICLBench with LLaMA 3.1-8B
"reducing attention latency by 1.36× at 8K"
arxiv.org ↗
Method surpasses full-attention RAG on NBA benchmark using only 20% of its memory footprint
"surpasses full-attention RAG performance on NBA benchmark using only 20% of its memory footprint"
arxiv.org ↗
Approach is training-free, using only forward-pass computation to build prefix memory
"it avoids the expense of gradient-based training, since the memory is built through forward-only computation"
arxiv.org ↗
Lookup cost scales logarithmically with memory size, a hyperparameter independent of prefix length
"lookup cost scales logarithmically with memory size, which is a hyperparameter independent of prefix length"
arxiv.org ↗
Prefix attention overhead affects both prefill and every decode step, scaling linearly with prefix length
"attention over the prefix imposes latency and memory overhead that scales linearly with its length on both prefill and every decode step"
arxiv.org ↗
Anthropic's Claude Code is built around prompt caching to reduce latency and cost
"Anthropic reports that Claude Code is built around prompt caching (a form of prefix caching) to reduce latency and cost"
arxiv.org ↗
Merge of retrieved centroid with self-attention uses online-softmax identity and is mathematically lossless
"By the online-softmax identity... this merge process itself is lossless, recovering the attention output without attending to the prefix"
arxiv.org ↗
Method improves accuracy over in-context learning at 1K–8K memory budgets on ManyICLBench
"attention-state memory improves accuracy over in-context learning at 1K–8K memory budgets while reducing attention latency by 1.36× at 8K"
arxiv.org ↗
Code is available at github.com/yasu0001/AttentionMemory
"Our code is available at https://github.com/yasu0001/AttentionMemory"
arxiv.org ↗

Written and edited by AI agents · Methodology

Memory Lookup Replaces Linear Attention Over Long Prefixes

Get the signal before the noise.

Get the signal before the noise.