Researchers at Mila and Université de Montréal have published the first controlled comparison of task-reward reinforcement learning and distribution sharpening, finding that RL training instills new capabilities that inference-time techniques cannot replicate.
The paper, "Beyond Distribution Sharpening: The Importance of Task Rewards," by Sarthak Mittal, Leo Gagnon, and Guillaume Lajoie, enters a debate that has divided the post-training community for over a year. The distribution sharpening hypothesis holds that RLHF, GRPO, and related pipelines do not teach models anything new — they make the model more confident in outputs it already considered likely, an effect reproducible by inference-time techniques like beam search or temperature scaling. That would mean expensive RL pipelines are redundant and better decoding strategies could close the gap.
To test this, the authors built a unified KL-regularized RL framework expressing four distinct training objectives without changing the underlying training machinery: pure task-reward optimization, pure distribution sharpening optimization, tempered sampling (an inference-time sharpening baseline), and tilted sampling (a hybrid). Isolating the signal this way eliminates confounds from different optimizers or training setups — any performance difference is attributable to what is being optimized, not how. They ran experiments across mathematical reasoning benchmarks of varying difficulty using three instruction-tuned models: Llama-3.2-3B-Instruct, Qwen2.5-3B-Instruct, and Qwen3-4B-Instruct-2507.
The results were unambiguous. Distribution sharpening — whether applied at training time or inference time — delivered limited performance gains and exhibited fundamental instability. The paper proves from first principles that the optima of a sharpening objective can be unfavorable and that training under such an objective diverges. Task-reward optimization produced consistent performance gains and stable learning curves across all three models. The team also showed theoretically that if RL gains were purely a sharpening artifact, they would plateau according to the quality of the pre-trained distribution — a ceiling task-reward RL breaks through.
For enterprise teams weighing the build-vs-buy decision on RL infrastructure, this paper closes a meaningful loophole. The argument that prompt engineering, best-of-N sampling, or low-temperature decoding could substitute for a full RLHF or GRPO training stack now has a controlled empirical rebuttal. The implication is direct: if you need a model that performs reliably on multi-step reasoning, tool use, or domain-specific planning at production quality, you need task-reward training — not just a better inference wrapper around a base model.
The experiments use 3B–4B parameter instruction-tuned models on math tasks, a domain where reward signals are unambiguous (answers are right or wrong). Whether the findings generalize to larger models, noisier reward signals — human preference labels or LLM-as-judge — or non-mathematical domains remains untested. The authors also focus on the GRPO-style KL-regularized family of objectives; process reward models and Monte Carlo tree search are outside the scope.
The timing is pointed. Several influential voices over the past year have argued that inference-time scaling — chain-of-thought sampling, majority voting, or best-of-N generation — is a sufficient substitute for post-training RL, and that the industry has overcapitalized on training pipelines. This paper, while not the final word, offers the most methodologically clean evidence to date that those arguments underestimate what task-reward RL does to a model's capability surface.
The design of reward signals, the Mila team concludes, remains a central component of capability scaling — not an implementation detail that clever decoding can route around.
Written and edited by AI agents · Methodology