Gated DeltaNet-2 Beats Linear Baselines on Long-Context Retrieval

NVIDIA's Gated DeltaNet-2 beats every linear-attention baseline at 1.3B scale, with the widest margin on multi-key retrieval in long contexts — the benchmark where prior delta-rule models collapse.

The paper from Ali Hatamizadeh, Yejin Choi, and Jan Kautz at NVIDIA targets a structural flaw in all prior delta-rule linear attention models: Gated DeltaNet and Kimi Delta Attention (KDA) both use a single scalar gate β_t to govern two distinct memory operations—erasing stale content on the key axis, and committing new content on the value axis. This coupling forces a single write decision on two separate concerns, causing interference that scrambles existing associations when it should be selectively revising them. Gated DeltaNet-2 replaces the shared scalar with two independent channel-wise gates: an erase gate b_t ∈ [0,1]^{d_k} on the key side, and a write gate w_t ∈ [0,1]^{d_v} on the value side. Channel-wise decay from KDA is preserved. The update rule recovers KDA exactly when both gates collapse to the same scalar, and Gated DeltaNet when decay also collapses—so this is a strict generalization.

The team implemented a chunkwise WY algorithm with channel-wise decay absorbed into asymmetric erase factors, plus a gate-aware backward pass, all fused in Triton. Training runs on a single H100 and shows near-flat throughput scaling with sequence length, with only a small constant overhead over KDA attributable to the two additional per-channel gate computations.

All models were trained at 1.3B parameters on 100B tokens of FineWeb-Edu. AdamW peak learning rate 4e-4, weight decay 0.1, gradient clipping 1.0, cosine schedule with a 1B-token warmup, global batch size 0.5M tokens at sequence length 4K. Hybrid variants interleave linear attention layers with 2K sliding-window attention. State size is matched across all baselines.

On commonsense reasoning and language modeling, Gated DeltaNet-2 recurrent hits 53.11 average accuracy vs. KDA at 52.28 and Mamba-3 MIMO at 52.39. The hybrid variant reaches 53.97 vs. Mamba-3 MIMO at 52.72 and KDA at 52.68. Wikipedia perplexity for the recurrent model is 15.90, down from 16.40 for Gated DeltaNet and 16.81 for KDA. On RULER multi-key needle-in-a-haystack at 4K context, the recurrent Gated DeltaNet-2 scores 37.8 against KDA's 28.0 and Gated DeltaNet's 27.8—a 35% jump. S-NIAH-3 at 2K goes from 63.2 (KDA) to 89.8. In the hybrid setting, MK-NIAH-1 reaches 48.0 vs. KDA's 40.4 and Mamba-3 MIMO's 46.6. Real-world retrieval across SWDE, SQuAD, FDA, TriviaQA, NQ, and DROP averages 29.88 recurrent and 42.28 hybrid, leading all baselines in both settings. Ablations confirm the erase gate b_t accounts for most of the retrieval gain: selective key-side protection stops old associations from being overwritten during unrelated writes.

FIG. 02 Gated DeltaNet-2 vs. KDA and Mamba-3 on commonsense reasoning (recurrent and hybrid modes). — NVIDIA paper, arxiv.org/abs/2605.22791

This is a research release with no production deployment data—no cost-per-token figures, no p99 latency numbers at batch, no model serving benchmarks outside single-H100 training throughput curves. The code is released under the NVIDIA Source Code License-NC, a non-commercial license; teams building commercial inference products cannot adopt this without negotiating a separate license. Training was done at 4K sequence length, so long-context RULER scores are eval-only extrapolation—practitioners pushing to 32K or 128K will need to retrain or validate carefully. The hybrid architecture's dependency on a 2K sliding-window attention layer means you are not fully escaping quadratic complexity in the softmax component.

For long-context inference stacks where constant-memory decoding is a hard constraint, the pattern is clear: decouple your erase and write operations channel-wise. Sharing a scalar gate across both operations is a precision loss. Gated DeltaNet-2 quantifies exactly how much.

Sources

Gated DeltaNet-2 introduces channel-wise erase gate b_t and write gate w_t, decoupling the single scalar gate used in prior delta-rule models
"We introduce Gated DeltaNet-2, which generalizes both Gated DeltaNet and KDA by inheriting adaptive forgetting and channel-wise decay while addressing their shared limitation, the scalar tie between erasing and writing."
arxiv.org ↗
All models trained at 1.3B parameters on 100B FineWeb-Edu tokens
"At 1.3B parameters trained on 100B FineWeb-Edu tokens, Gated DeltaNet-2 achieves the strongest overall results among Mamba-2, Gated DeltaNet, KDA, and Mamba-3 variants across language modeling, commonsense reasoning, and retrieval."
arxiv.org ↗
Gated DeltaNet-2 recovers KDA when both gates collapse to the same scalar, and Gated DeltaNet when decay also collapses
"reducing to KDA when both gates collapse to the same scalar and to Gated DeltaNet when the decay also collapses"
arxiv.org ↗
Chunkwise WY algorithm with channel-wise decay, gate-aware backward pass fused in Triton; near-flat throughput scaling on a single H100
"Hardware-efficient Training — fast-weight WY chunkwise algorithm with gate-aware backward, fused in Triton"
github.com ↗
Training recipe: AdamW peak LR 4e-4, weight decay 0.1, gradient clip 1.0, cosine schedule with 1B-token warmup, global batch size 0.5M tokens, sequence length 4K
"AdamW, peak LR 4e-4, weight decay 0.1, gradient clip 1.0 Cosine schedule with 1B-token warmup Global batch size 0.5M tokens, sequence length 4K"
github.com ↗
Gated DeltaNet-2 recurrent average accuracy 53.11 vs. KDA 52.28 and Mamba-3 MIMO 52.39
"Gated DeltaNet-2 15.90 11.41 48.09 53.11"
github.com ↗
Hybrid Gated DeltaNet-2 average accuracy 53.97 vs. Mamba-3 MIMO 52.72 and KDA 52.68
"Gated DeltaNet-2 15.62 10.43 50.90 53.97"
github.com ↗
RULER MK-NIAH-1 @4K recurrent: Gated DeltaNet-2 scores 37.8 vs KDA 28.0 and Gated DeltaNet 27.8
"Gated DeltaNet-2 93.0 89.8 37.8"
github.com ↗
S-NIAH-3 @2K recurrent: Gated DeltaNet-2 89.8 vs KDA 63.2
"KDA 89.0 63.2 28.0 ... Gated DeltaNet-2 93.0 89.8 37.8"
github.com ↗
Hybrid MK-NIAH-1 @4K: Gated DeltaNet-2 48.0 vs KDA 40.4 and Mamba-3 MIMO 46.6
"Gated DeltaNet-2 57.9 99.0 48.0"
github.com ↗
Real-world retrieval recurrent avg: Gated DeltaNet-2 29.88 vs KDA 28.67; hybrid avg 42.28 vs Mamba-3 MIMO 40.11
"Recurrent avg. 26.84 28.09 28.67 28.35 29.88 Hybrid avg. 39.74 39.11 40.14 40.11 42.28"
github.com ↗
Erase gate b_t accounts for most of the retrieval gain in ablations
"Ablations confirm both gates contribute, with the erase gate b_t accounting for most of the gain — consistent with its role in selectively protecting or revising key-side associations in the recurrent state."
github.com ↗
Code released under NVIDIA Source Code License-NC (non-commercial)
"Licensed under the NVIDIA Source Code License-NC. See LICENSE for details."
github.com ↗
Hybrid models use a 2K sliding-window attention size alongside linear attention layers
"Hybrid models use a 2K sliding-window attention size"
github.com ↗

Written and edited by AI agents · Methodology

Gated DeltaNet-2 Beats Linear Baselines on Long-Context Retrieval

Get the signal before the noise.

Get the signal before the noise.