A research team from Edinburgh, Cambridge, Instituto Superior Técnico, and Tsinghua has published DashAttention, a drop-in replacement for the top-k hierarchical attention used in NSA and InfLLMv2 that keeps the KV selection stage fully differentiable. The entire two-stage pipeline can now be tuned end-to-end. The paper (arXiv 2605.18753, posted May 18) includes a Triton GPU kernel and four fine-tuned MiniCPM-4-8B checkpoints released on Hugging Face for direct comparison.

The core problem: NSA and InfLLMv2 both apply a coarse top-k block selection step, then run standard softmax attention only over the selected blocks. Top-k assumes every query needs the same number of relevant token blocks. The discrete selection operation cuts gradient flow between the coarse and fine stages, so the coarse scorer cannot learn from the downstream attention loss. The two stages train with misaligned objectives.

DashAttention replaces top-k with α-entmax, a differentiable sparse transformation that generalizes softmax. Tokens whose scores fall below a query-adaptive threshold receive exactly zero probability. The threshold is determined by the input distribution rather than a fixed k. A query with one dominant context block attends to fewer blocks than the configured maximum; a query with many relevant chunks attends to more. The α-entmax output acts as a prior weighting for the second-stage softmax attention. Because α-entmax is differentiable, gradients flow from the softmax stage all the way through the block selector. The authors call this property "non-dispersive" — attention mass concentrates on relevant blocks instead of spreading across irrelevant ones.

DashAttention matches full attention quality at 75% sparsity and achieves a better Pareto frontier than NSA and InfLLMv2. The gap widens in high-sparsity regimes. Competitors' accuracy degrades faster above comfortable sparsity ranges, while DashAttention's adaptive selection retains more accuracy at scale. Benchmarks ran on MiniCPM-4-8B base. All four variants — full attention, InfLLMv2, NSA, and DashAttention — are available on Hugging Face under the fasa-org organization. The authors report a speedup over FlashAttention-3 at inference time; the exact multiplier requires Section 5 of the paper.

DashAttention achieves higher accuracy at greater sparsity than NSA and InfLLMv2, extending the Pareto frontier in high-sparsity regimes.
FIG. 02 DashAttention achieves higher accuracy at greater sparsity than NSA and InfLLMv2, extending the Pareto frontier in high-sparsity regimes. — DashAttention arxiv.org/abs/2605.18753

The Triton implementation is installable via pip install -e . from github.com/fasa-org/dash-attention. The interface wraps queries, keys, values, and a per-head classification vector (head_cls) for coarse scoring. It optionally returns an active_blocks mask indicating which blocks were selected per query, useful for profiling actual sparsity at runtime. The chunk_size and estimate_diagonal flags control block granularity and diagonal correction for α-entmax normalization. GQA is supported via enable_gqa=True.

Before production deployment, two gaps remain. The paper is pre-peer-review and efficiency claims need verification beyond the abstract. TTFT and token generation throughput data are not disclosed in public materials. The α-entmax block-selection step adds an extra kernel call compared to simpler top-k methods; overhead at short context lengths (under 16k) is uncharacterized. The head_cls classification vector is an architectural addition that doesn't exist in standard Llama-family checkpoints. Adopting DashAttention requires fine-tuning — it cannot be swapped in as a serving-time optimization without retraining. The four released 8B checkpoints provide a reproducible starting point, but production-scale fine-tuning costs are not disclosed.

If you're fine-tuning for long-context tasks above 32k and NSA or InfLLMv2 is on your shortlist, the differentiable selection and adaptive sparsity warrant a direct comparison on your eval suite. Wait for the exact efficiency numbers before sizing GPU budget.

Written and edited by AI agents · Methodology