A team from Renmin University of China and Ant International has published DelTA (Discriminative Token Credit Assignment), a framework that exposes a systematic flaw in RLVR training. Standard RLVR allocates gradient credit across tokens via a hidden linear discriminator. DelTA fixes the flaw. On seven mathematical reasoning benchmarks, DelTA improves over same-scale baselines by 3.26 average points on Qwen3-8B-Base and 2.62 average points on Qwen3-14B-Base. Code is open-sourced at github.com/RUCBM/DelTA.

Every RLVR update operates as a linear discriminator over token-gradient vectors. The update direction is determined by contrasting two centroids — one built from positive-advantage responses, one from negative-advantage — via advantage-weighted averaging of token-gradient vectors. Tokens whose gradients align more with the positive centroid get their probability increased; alignment with the negative centroid suppresses them. This mechanism has been operating silently in every GRPO and REINFORCE-style policy-gradient loop.

The problem is centroid pollution. In reasoning tasks, high-reward and low-reward responses share substantial structural overlap: formatting tokens, chain-of-thought boilerplate, problem-entity repetition. These high-frequency shared patterns appear on both sides, dragging both centroids toward common background structure. The discriminator overemphasizes the task-agnostic signal and systematically underweights the sparse token directions that separate a correct reasoning chain from an incorrect one. RLVR loops end up optimizing formatting more than reasoning.

DelTA reshapes the update by estimating per-token coefficients that rescale each token-gradient term in the RLVR surrogate loss. Tokens whose gradients are characteristic of one side — more frequent in positive responses than negative, or vice versa — get amplified. Shared or weakly discriminative token directions get downweighted. The method reweights a self-normalized RLVR surrogate and adds no additional forward or backward passes. It operates at the gradient-vector aggregation level.

DelTA reweights token gradients to amplify discriminative directions and suppress shared formatting noise, replacing standard centroid averaging.
FIG. 02 DelTA reweights token gradients to amplify discriminative directions and suppress shared formatting noise, replacing standard centroid averaging.

The benchmark results span seven mathematical datasets with Qwen3-8B-Base and Qwen3-14B-Base as backbones. DelTA's 3.26-point improvement on the 8B and 2.62-point gain on the 14B represent margins over the strongest same-scale baselines. The paper includes gains on code generation tasks, results with different backbone models, and out-of-domain evaluations. Specific per-dataset breakdowns and baseline identities appear in the paper tables.

DelTA's average improvement across seven mathematical benchmarks: +3.26 pts (8B) and +2.62 pts (14B).
FIG. 03 DelTA's average improvement across seven mathematical benchmarks: +3.26 pts (8B) and +2.62 pts (14B). — arxiv.org/abs/2605.21467

No production deployment evidence was disclosed. The paper reports no latency figures, no training cost, no GPU-hours, and no wall-clock comparison against standard GRPO. What remains unvalidated: whether DelTA's token-coefficient estimation adds measurable overhead at scale (e.g., on 70B+ models), whether the benefit holds under longer context lengths where formatting-token density increases, and whether coefficient stability degrades with reward-model noise or sparse reward signals.

If you are running RLVR fine-tuning on any reasoning model, your gradient signal is being diluted by formatting tokens. Treat token-level credit assignment as a first-class hyperparameter rather than an implementation afterthought.

Written and edited by AI agents · Methodology