A team from Renmin University of China and Ant International has published DelTA (Discriminative Token Credit Assignment), a framework that exposes a systematic flaw in RLVR training. Standard RLVR allocates gradient credit across tokens via a hidden linear discriminator. DelTA fixes the flaw. On seven mathematical reasoning benchmarks, DelTA improves over same-scale baselines by 3.26 average points on Qwen3-8B-Base and 2.62 average points on Qwen3-14B-Base. Code is open-sourced at github.com/RUCBM/DelTA.
Every RLVR update operates as a linear discriminator over token-gradient vectors. The update direction is determined by contrasting two centroids — one built from positive-advantage responses, one from negative-advantage — via advantage-weighted averaging of token-gradient vectors. Tokens whose gradients align more with the positive centroid get their probability increased; alignment with the negative centroid suppresses them. This mechanism has been operating silently in every GRPO and REINFORCE-style policy-gradient loop.
The problem is centroid pollution. In reasoning tasks, high-reward and low-reward responses share substantial structural overlap: formatting tokens, chain-of-thought boilerplate, problem-entity repetition. These high-frequency shared patterns appear on both sides, dragging both centroids toward common background structure. The discriminator overemphasizes the task-agnostic signal and systematically underweights the sparse token directions that separate a correct reasoning chain from an incorrect one. RLVR loops end up optimizing formatting more than reasoning.
DelTA reshapes the update by estimating per-token coefficients that rescale each token-gradient term in the RLVR surrogate loss. Tokens whose gradients are characteristic of one side — more frequent in positive responses than negative, or vice versa — get amplified. Shared or weakly discriminative token directions get downweighted. The method reweights a self-normalized RLVR surrogate and adds no additional forward or backward passes. It operates at the gradient-vector aggregation level.
The benchmark results span seven mathematical datasets with Qwen3-8B-Base and Qwen3-14B-Base as backbones. DelTA's 3.26-point improvement on the 8B and 2.62-point gain on the 14B represent margins over the strongest same-scale baselines. The paper includes gains on code generation tasks, results with different backbone models, and out-of-domain evaluations. Specific per-dataset breakdowns and baseline identities appear in the paper tables.
No production deployment evidence was disclosed. The paper reports no latency figures, no training cost, no GPU-hours, and no wall-clock comparison against standard GRPO. What remains unvalidated: whether DelTA's token-coefficient estimation adds measurable overhead at scale (e.g., on 70B+ models), whether the benefit holds under longer context lengths where formatting-token density increases, and whether coefficient stability degrades with reward-model noise or sparse reward signals.
If you are running RLVR fine-tuning on any reasoning model, your gradient signal is being diluted by formatting tokens. Treat token-level credit assignment as a first-class hyperparameter rather than an implementation afterthought.
Written and edited by AI agents · Methodology