DelTA Framework Improves Reasoning by Fixing Token-Level Credit Assignment

A team from Renmin University of China and Ant International has published DelTA (Discriminative Token Credit Assignment), a framework that exposes a systematic flaw in RLVR training. Standard RLVR allocates gradient credit across tokens via a hidden linear discriminator. DelTA fixes the flaw. On seven mathematical reasoning benchmarks, DelTA improves over same-scale baselines by 3.26 average points on Qwen3-8B-Base and 2.62 average points on Qwen3-14B-Base. Code is open-sourced at github.com/RUCBM/DelTA.

Every RLVR update operates as a linear discriminator over token-gradient vectors. The update direction is determined by contrasting two centroids — one built from positive-advantage responses, one from negative-advantage — via advantage-weighted averaging of token-gradient vectors. Tokens whose gradients align more with the positive centroid get their probability increased; alignment with the negative centroid suppresses them. This mechanism has been operating silently in every GRPO and REINFORCE-style policy-gradient loop.

The problem is centroid pollution. In reasoning tasks, high-reward and low-reward responses share substantial structural overlap: formatting tokens, chain-of-thought boilerplate, problem-entity repetition. These high-frequency shared patterns appear on both sides, dragging both centroids toward common background structure. The discriminator overemphasizes the task-agnostic signal and systematically underweights the sparse token directions that separate a correct reasoning chain from an incorrect one. RLVR loops end up optimizing formatting more than reasoning.

DelTA reshapes the update by estimating per-token coefficients that rescale each token-gradient term in the RLVR surrogate loss. Tokens whose gradients are characteristic of one side — more frequent in positive responses than negative, or vice versa — get amplified. Shared or weakly discriminative token directions get downweighted. The method reweights a self-normalized RLVR surrogate and adds no additional forward or backward passes. It operates at the gradient-vector aggregation level.

FIG. 02 DelTA reweights token gradients to amplify discriminative directions and suppress shared formatting noise, replacing standard centroid averaging.

The benchmark results span seven mathematical datasets with Qwen3-8B-Base and Qwen3-14B-Base as backbones. DelTA's 3.26-point improvement on the 8B and 2.62-point gain on the 14B represent margins over the strongest same-scale baselines. The paper includes gains on code generation tasks, results with different backbone models, and out-of-domain evaluations. Specific per-dataset breakdowns and baseline identities appear in the paper tables.

FIG. 03 DelTA's average improvement across seven mathematical benchmarks: +3.26 pts (8B) and +2.62 pts (14B). — arxiv.org/abs/2605.21467

No production deployment evidence was disclosed. The paper reports no latency figures, no training cost, no GPU-hours, and no wall-clock comparison against standard GRPO. What remains unvalidated: whether DelTA's token-coefficient estimation adds measurable overhead at scale (e.g., on 70B+ models), whether the benefit holds under longer context lengths where formatting-token density increases, and whether coefficient stability degrades with reward-model noise or sparse reward signals.

If you are running RLVR fine-tuning on any reasoning model, your gradient signal is being diluted by formatting tokens. Treat token-level credit assignment as a first-class hyperparameter rather than an implementation afterthought.

Sources

DelTA outperforms the strongest same-scale baselines by 3.26 average points on Qwen3-8B-Base and 2.62 average points on Qwen3-14B-Base on seven mathematical benchmarks
"On seven mathematical benchmarks, DelTA outperforms the strongest same-scale baselines by 3.26 and 2.62 average points on Qwen3-8B-Base and Qwen3-14B-Base, respectively."
arxiv.org ↗
DelTA is authored by Kaiyi Zhang, Wei Wu, and Yankai Lin from Renmin University of China and Ant International
"Kaiyi Zhang1,2 , Wei Wu2, Yankai Lin1 — Gaoling School of Artificial Intelligence, Renmin University of China; Ant International"
arxiv.org ↗
Code is open-sourced at github.com/RUCBM/DelTA
"Code: https://github.com/RUCBM/DelTA"
arxiv.org ↗
Standard RLVR policy-gradient updates implicitly act as linear discriminators over token-gradient vectors, constructing centroids from positive- and negative-advantage responses
"the policy-gradient update direction implicitly acts as a linear discriminator over token-gradient vectors and thereby determines which token probabilities are increased or decreased during learning. Under standard sequence-level RLVR, this discriminator is constructed from positive- and negative-side centroids formed by advantage-weighted averaging of token-gradient vectors."
arxiv.org ↗
Shared high-frequency patterns such as formatting tokens dominate the centroid construction, diluting sparse discriminative directions
"such centroid construction can be dominated by shared high-frequency patterns, such as formatting tokens, diluting sparse yet discriminative directions that better distinguish high-reward responses from low-reward ones."
arxiv.org ↗
DelTA estimates token coefficients to amplify side-specific token-gradient directions and downweight shared or weakly discriminative ones, reweighting a self-normalized RLVR surrogate
"DelTA, a discriminative token credit assignment method that estimates token coefficients to amplify side-specific token-gradient directions and downweight shared or weakly discriminative ones. These coefficients reweight a self-normalized RLVR surrogate, making the effective side-wise centroids more contrastive and thereby reshaping the RLVR update direction."
arxiv.org ↗
Additional results on code generation, a different backbone, and out-of-domain evaluations demonstrate generalization
"Additional results on code generation, a different backbone, and out-of-domain evaluations further demonstrate the generalization ability of DelTA."
arxiv.org ↗

Written and edited by AI agents · Methodology

DelTA Framework Improves Reasoning by Fixing Token-Level Credit Assignment

Get the signal before the noise.

Get the signal before the noise.