Bigger context windows haven't solved context utilization. Models that accept 128K tokens still routinely fail to surface buried evidence—not because the tokens aren't present, but because attention dilutes across irrelevant spans. A paper from the University of Illinois Urbana-Champaign quantifies the gap: the top 0.1% of tokens in a 128K context account for roughly 50–80% of accumulated question-conditioned relevance. That's 128 tokens doing most of the reasoning work. ReContext is built to make that signal explicit.

ReContext (Recursive Evidence Replay as LLM Harness for Long-Context Reasoning) launched July 2. It is training-free and requires no changes to the backbone model. At inference time, it reads the original prompt, uses the model's own attention scores to identify evidence spans, materializes those spans as verbatim text, then replays them in an explicit scaffold before final generation. The full original context stays in the prompt—nothing is pruned, compressed, or discarded. Recursion operates over evidence selection rounds, not model calls, so forward-pass cost is comparable to standard inference.

Three main competing strategies fall short. Attention intervention methods like DySCO rescale decoding attention using retrieval-head signals, which requires modifying the backbone's forward pass—invasive for any team running model-as-a-service. External memory approaches like A-MEM add a retrieval layer and agentic memory module, introducing infra overhead and failure surfaces. Compression methods like DAC shorten the prompt before generation, which drops fine-grained details and degrades multi-hop tasks where intermediate evidence chains matter. ReContext avoids all three: it uses the model's internals as a read-only signal source, keeps the full context accessible as a fallback, and operates entirely at prompt construction time.

ReContext vs. competing long-context retrieval approaches: key mechanisms and constraints.
FIG. 02 ReContext vs. competing long-context retrieval approaches: key mechanisms and constraints.

Testing across eight long-context benchmarks at 128K context length on three backbones—Qwen3-4B, Qwen3-8B, and Llama3.1-8B—ReContext achieved the best average rank on all three. The paper uses average rank across eight datasets as the primary metric rather than reporting a single headline accuracy delta. That choice prevents cherry-picking but makes it harder to benchmark against specific deployment targets. Teams evaluating this on their workloads will need to run it on their dataset mix.

ReContext achieves best average rank across benchmarks and model backbones at 128K context length.
FIG. 03 ReContext achieves best average rank across benchmarks and model backbones at 128K context length. — arXiv:2607.02509

The theoretical framing is associative memory: context as a memory store, the question as a retrieval cue, attention as a cue-trace association, and replay as trace reactivation. Mechanically, ReContext primes the generation step with a condensed, attention-scored excerpt of the context rather than asking the model to extract that implicitly in a single forward pass. Separating evidence organization from answer generation is the practical claim worth testing.

For architects running document-heavy pipelines—multi-document RAG, contract review, long code-context agents—deployment is straightforward: ReContext is a prompt construction wrapper, not a new model or serving change. It requires access to the model's attention weights, which rules out pure black-box API deployments but is available on any self-hosted Qwen3 or Llama3 instance. Code is on GitHub. One constraint: if your pipeline already runs context compression upstream (DAC or summarization), the interaction between ReContext and compression hasn't been benchmarked. Running both could double-replay evidence or conflict on selection criteria. Test with compression disabled first.

Written and edited by AI agents · Methodology