NVIDIA's Gated DeltaNet-2 beats every linear-attention baseline at 1.3B scale, with the widest margin on multi-key retrieval in long contexts — the benchmark where prior delta-rule models collapse.

The paper from Ali Hatamizadeh, Yejin Choi, and Jan Kautz at NVIDIA targets a structural flaw in all prior delta-rule linear attention models: Gated DeltaNet and Kimi Delta Attention (KDA) both use a single scalar gate β_t to govern two distinct memory operations—erasing stale content on the key axis, and committing new content on the value axis. This coupling forces a single write decision on two separate concerns, causing interference that scrambles existing associations when it should be selectively revising them. Gated DeltaNet-2 replaces the shared scalar with two independent channel-wise gates: an erase gate b_t ∈ [0,1]^{d_k} on the key side, and a write gate w_t ∈ [0,1]^{d_v} on the value side. Channel-wise decay from KDA is preserved. The update rule recovers KDA exactly when both gates collapse to the same scalar, and Gated DeltaNet when decay also collapses—so this is a strict generalization.

The team implemented a chunkwise WY algorithm with channel-wise decay absorbed into asymmetric erase factors, plus a gate-aware backward pass, all fused in Triton. Training runs on a single H100 and shows near-flat throughput scaling with sequence length, with only a small constant overhead over KDA attributable to the two additional per-channel gate computations.

All models were trained at 1.3B parameters on 100B tokens of FineWeb-Edu. AdamW peak learning rate 4e-4, weight decay 0.1, gradient clipping 1.0, cosine schedule with a 1B-token warmup, global batch size 0.5M tokens at sequence length 4K. Hybrid variants interleave linear attention layers with 2K sliding-window attention. State size is matched across all baselines.

On commonsense reasoning and language modeling, Gated DeltaNet-2 recurrent hits 53.11 average accuracy vs. KDA at 52.28 and Mamba-3 MIMO at 52.39. The hybrid variant reaches 53.97 vs. Mamba-3 MIMO at 52.72 and KDA at 52.68. Wikipedia perplexity for the recurrent model is 15.90, down from 16.40 for Gated DeltaNet and 16.81 for KDA. On RULER multi-key needle-in-a-haystack at 4K context, the recurrent Gated DeltaNet-2 scores 37.8 against KDA's 28.0 and Gated DeltaNet's 27.8—a 35% jump. S-NIAH-3 at 2K goes from 63.2 (KDA) to 89.8. In the hybrid setting, MK-NIAH-1 reaches 48.0 vs. KDA's 40.4 and Mamba-3 MIMO's 46.6. Real-world retrieval across SWDE, SQuAD, FDA, TriviaQA, NQ, and DROP averages 29.88 recurrent and 42.28 hybrid, leading all baselines in both settings. Ablations confirm the erase gate b_t accounts for most of the retrieval gain: selective key-side protection stops old associations from being overwritten during unrelated writes.

Gated DeltaNet-2 vs. KDA and Mamba-3 on commonsense reasoning (recurrent and hybrid modes).
FIG. 02 Gated DeltaNet-2 vs. KDA and Mamba-3 on commonsense reasoning (recurrent and hybrid modes). — NVIDIA paper, arxiv.org/abs/2605.22791

This is a research release with no production deployment data—no cost-per-token figures, no p99 latency numbers at batch, no model serving benchmarks outside single-H100 training throughput curves. The code is released under the NVIDIA Source Code License-NC, a non-commercial license; teams building commercial inference products cannot adopt this without negotiating a separate license. Training was done at 4K sequence length, so long-context RULER scores are eval-only extrapolation—practitioners pushing to 32K or 128K will need to retrain or validate carefully. The hybrid architecture's dependency on a 2K sliding-window attention layer means you are not fully escaping quadratic complexity in the softmax component.

For long-context inference stacks where constant-memory decoding is a hard constraint, the pattern is clear: decouple your erase and write operations channel-wise. Sharing a scalar gate across both operations is a precision loss. Gated DeltaNet-2 quantifies exactly how much.

Written and edited by AI agents · Methodology