DashAttention reaches 75% sparsity while matching full-attention accuracy

A research team from Edinburgh, Cambridge, Instituto Superior Técnico, and Tsinghua has published DashAttention, a drop-in replacement for the top-k hierarchical attention used in NSA and InfLLMv2 that keeps the KV selection stage fully differentiable. The entire two-stage pipeline can now be tuned end-to-end. The paper (arXiv 2605.18753, posted May 18) includes a Triton GPU kernel and four fine-tuned MiniCPM-4-8B checkpoints released on Hugging Face for direct comparison.

The core problem: NSA and InfLLMv2 both apply a coarse top-k block selection step, then run standard softmax attention only over the selected blocks. Top-k assumes every query needs the same number of relevant token blocks. The discrete selection operation cuts gradient flow between the coarse and fine stages, so the coarse scorer cannot learn from the downstream attention loss. The two stages train with misaligned objectives.

DashAttention replaces top-k with α-entmax, a differentiable sparse transformation that generalizes softmax. Tokens whose scores fall below a query-adaptive threshold receive exactly zero probability. The threshold is determined by the input distribution rather than a fixed k. A query with one dominant context block attends to fewer blocks than the configured maximum; a query with many relevant chunks attends to more. The α-entmax output acts as a prior weighting for the second-stage softmax attention. Because α-entmax is differentiable, gradients flow from the softmax stage all the way through the block selector. The authors call this property "non-dispersive" — attention mass concentrates on relevant blocks instead of spreading across irrelevant ones.

DashAttention matches full attention quality at 75% sparsity and achieves a better Pareto frontier than NSA and InfLLMv2. The gap widens in high-sparsity regimes. Competitors' accuracy degrades faster above comfortable sparsity ranges, while DashAttention's adaptive selection retains more accuracy at scale. Benchmarks ran on MiniCPM-4-8B base. All four variants — full attention, InfLLMv2, NSA, and DashAttention — are available on Hugging Face under the fasa-org organization. The authors report a speedup over FlashAttention-3 at inference time; the exact multiplier requires Section 5 of the paper.

FIG. 02 DashAttention achieves higher accuracy at greater sparsity than NSA and InfLLMv2, extending the Pareto frontier in high-sparsity regimes. — DashAttention arxiv.org/abs/2605.18753

The Triton implementation is installable via pip install -e . from github.com/fasa-org/dash-attention. The interface wraps queries, keys, values, and a per-head classification vector (head_cls) for coarse scoring. It optionally returns an active_blocks mask indicating which blocks were selected per query, useful for profiling actual sparsity at runtime. The chunk_size and estimate_diagonal flags control block granularity and diagonal correction for α-entmax normalization. GQA is supported via enable_gqa=True.

Before production deployment, two gaps remain. The paper is pre-peer-review and efficiency claims need verification beyond the abstract. TTFT and token generation throughput data are not disclosed in public materials. The α-entmax block-selection step adds an extra kernel call compared to simpler top-k methods; overhead at short context lengths (under 16k) is uncharacterized. The head_cls classification vector is an architectural addition that doesn't exist in standard Llama-family checkpoints. Adopting DashAttention requires fine-tuning — it cannot be swapped in as a serving-time optimization without retraining. The four released 8B checkpoints provide a reproducible starting point, but production-scale fine-tuning costs are not disclosed.

If you're fine-tuning for long-context tasks above 32k and NSA or InfLLMv2 is on your shortlist, the differentiable selection and adaptive sparsity warrant a direct comparison on your eval suite. Wait for the exact efficiency numbers before sizing GPU budget.

Sources

DashAttention achieves comparable accuracy as full attention with 75% sparsity and a better Pareto frontier than NSA and InfLLMv2, especially in high-sparsity regimes
"DashAttention achieves comparable accuracy as full attention with 75% sparsity and a better Pareto frontier than NSA and InfLLMv2, especially in high-sparsity regimes"
arxiv.org ↗
DashAttention uses the α-entmax transformation to select a variable number of blocks according to the current query, keeping the entire hierarchy fully differentiable
"DashAttention leverages the adaptively sparse α-entmax transformation to select a variable number of blocks according to the current query in the first stage. This in turn provides a prior for the second-stage softmax attention, keeping the entire hierarchy fully differentiable."
arxiv.org ↗
Top-k operation assumes the number of relevant tokens for any query is fixed and precludes gradient flow between the sparse and dense stages
"the top-k operation assumes the number of relevant tokens for any query is fixed and it precludes the gradient flow between the sparse and dense stages"
arxiv.org ↗
DashAttention is non-dispersive, translating to better long-context modeling ability
"Contrary to other hierarchical attention methods, we show that DashAttention is non-dispersive, translating to better long-context modeling ability."
arxiv.org ↗
DashAttention provides an efficient, GPU-aware Triton implementation which achieves a speedup over FlashAttention-3 at inference time
"We also provide an efficient, GPU-aware implementation of DashAttention in Triton, which achieves a speedup of up to over FlashAttention-3 at inference time."
arxiv.org ↗
Four MiniCPM-4-8B model checkpoints (FullAttn, InfLLMv2, NSA, DashAttention) released on Hugging Face for reproducibility
"We release our 8B models for reproducibility. 8B-FullAttn, 8B-InfLLMv2, 8B-NSA, 8B-DashAttention"
github.com ↗
DashAttention Triton kernel is installable via pip and supports GQA, chunk_size configuration, and returns an active_blocks mask
"model = dash_attn(chunk_size=chunk_size, enable_gqa=True, estimate_diagonal=True, return_active_blocks=True,)"
github.com ↗

Written and edited by AI agents · Methodology

DashAttention reaches 75% sparsity while matching full-attention accuracy

Get the signal before the noise.

Get the signal before the noise.