Inference-Time Scaling Cannot Replace Task-Reward RL, Mila Study Shows

Researchers at Mila and Université de Montréal have published the first controlled comparison of task-reward reinforcement learning and distribution sharpening, finding that RL training instills new capabilities that inference-time techniques cannot replicate.

The paper, "Beyond Distribution Sharpening: The Importance of Task Rewards," by Sarthak Mittal, Leo Gagnon, and Guillaume Lajoie, enters a debate that has divided the post-training community for over a year. The distribution sharpening hypothesis holds that RLHF, GRPO, and related pipelines do not teach models anything new — they make the model more confident in outputs it already considered likely, an effect reproducible by inference-time techniques like beam search or temperature scaling. That would mean expensive RL pipelines are redundant and better decoding strategies could close the gap.

To test this, the authors built a unified KL-regularized RL framework expressing four distinct training objectives without changing the underlying training machinery: pure task-reward optimization, pure distribution sharpening optimization, tempered sampling (an inference-time sharpening baseline), and tilted sampling (a hybrid). Isolating the signal this way eliminates confounds from different optimizers or training setups — any performance difference is attributable to what is being optimized, not how. They ran experiments across mathematical reasoning benchmarks of varying difficulty using three instruction-tuned models: Llama-3.2-3B-Instruct, Qwen2.5-3B-Instruct, and Qwen3-4B-Instruct-2507.

The results were unambiguous. Distribution sharpening — whether applied at training time or inference time — delivered limited performance gains and exhibited fundamental instability. The paper proves from first principles that the optima of a sharpening objective can be unfavorable and that training under such an objective diverges. Task-reward optimization produced consistent performance gains and stable learning curves across all three models. The team also showed theoretically that if RL gains were purely a sharpening artifact, they would plateau according to the quality of the pre-trained distribution — a ceiling task-reward RL breaks through.

For enterprise teams weighing the build-vs-buy decision on RL infrastructure, this paper closes a meaningful loophole. The argument that prompt engineering, best-of-N sampling, or low-temperature decoding could substitute for a full RLHF or GRPO training stack now has a controlled empirical rebuttal. The implication is direct: if you need a model that performs reliably on multi-step reasoning, tool use, or domain-specific planning at production quality, you need task-reward training — not just a better inference wrapper around a base model.

The experiments use 3B–4B parameter instruction-tuned models on math tasks, a domain where reward signals are unambiguous (answers are right or wrong). Whether the findings generalize to larger models, noisier reward signals — human preference labels or LLM-as-judge — or non-mathematical domains remains untested. The authors also focus on the GRPO-style KL-regularized family of objectives; process reward models and Monte Carlo tree search are outside the scope.

The timing is pointed. Several influential voices over the past year have argued that inference-time scaling — chain-of-thought sampling, majority voting, or best-of-N generation — is a sufficient substitute for post-training RL, and that the industry has overcapitalized on training pipelines. This paper, while not the final word, offers the most methodologically clean evidence to date that those arguments underestimate what task-reward RL does to a model's capability surface.

The design of reward signals, the Mila team concludes, remains a central component of capability scaling — not an implementation detail that clever decoding can route around.

Sources

Researchers at Mila and Université de Montréal published a controlled comparison of task-reward RL and distribution sharpening
"we present an explicit comparison between distribution sharpening and task-reward-based learning, utilizing RL as a tool to implement both paradigms"
arxiv.org ↗
The distribution sharpening hypothesis holds that RL pipelines merely make models more confident in already-likely outputs, reproducible by inference-time techniques
"this view holds that post-training primarily makes the model more confident in its existing preferences – it reduces uncertainty and concentrates probability mass on outputs that the model already considers plausible, rather than introducing fundamentally new behaviors"
arxiv.org ↗
The authors built a unified KL-regularized RL framework expressing four distinct objectives without changing the training machinery
"By varying the contribution of each term, we can express a spectrum of objectives within the exact same training procedure: pure task-reward optimization, distribution sharpening alone, or a combination of both"
arxiv.org ↗
Experiments were run on Llama-3.2-3B-Instruct, Qwen2.5-3B-Instruct, and Qwen3-4B-Instruct-2507 on math datasets
"our experiments using Llama-3.2-3B-Instruct, Qwen2.5-3B-Instruct and Qwen3-4B-Instruct-2507 on math datasets confirm that sharpening yields limited gains"
arxiv.org ↗
Distribution sharpening optimization is fundamentally unstable and its optima can be unfavorable, proven from first principles
"Our analysis reveals the inherent limitations of distribution sharpening, demonstrating from first principles how and why the optima can be unfavorable and the approach fundamentally unstable"
arxiv.org ↗
Task-reward optimization produced robust performance improvements and stable learning; sharpening yielded limited gains
"sharpening yields limited gains, whereas incorporating task-based reward signal can greatly help achieve robust performance improvements and stable learning"
arxiv.org ↗
If RL gains were purely a sharpening artifact, they would plateau based on the quality of the pre-trained distribution
"If the gains of RL fine-tuning arise primarily from distribution sharpening, then improvements may be achieved through better inference or confidence calibration, and would ultimately plateau based on the quality of the pre-trained distribution"
arxiv.org ↗
The design of reward signals remains a central component of capability scaling, per the paper's conclusion
"if task-reward optimization provides benefits beyond sharpening, then the design of reward signals remains a central component of capability scaling"
arxiv.org ↗

Written and edited by AI agents · Methodology