A paper from Mila and Université de Montréal, published April 17, settles a core debate in applied LLM research: reinforcement learning with task-based rewards teaches models new skills — it does not merely concentrate probability mass on outputs the model already favored. The finding has direct consequences for how AI teams budget post-training compute and design reward pipelines.
The debate centers on the "distribution sharpening hypothesis," which holds that RL fine-tuning works by making a model more confident in its existing preferences rather than expanding its capability surface. Proponents point to evidence that inference-time procedures — requiring no training at all — can recover strong performance on reasoning benchmarks by concentrating probability on high-likelihood outputs. If true, the implication is that RL training is expensive redundancy: better calibration or smarter decoding would suffice.
Researchers Sarthak Mittal, Leo Gagnon, and Guillaume Lajoie stress-tested that hypothesis using a single unified framework. They used the standard KL-regularized RL objective — which balances a reward maximization term against a KL divergence penalty relative to a reference distribution — and varied each component's contribution to isolate four training regimes: pure task-reward optimization, distribution sharpening optimization (where the "reward" is simply log-probability under the base model), and two hybrid variants called Tilted Sampling and Tempered Sampling. Because all four methods share the same underlying training procedure, observed differences reflect the signal being optimized rather than artifacts of switching frameworks.
The verdict against distribution sharpening is both theoretical and empirical. The paper demonstrates from first principles that the optima of sharpening-only objectives are unfavorable and that the optimization is unstable. Experiments across Llama-3.2-3B-Instruct, Qwen2.5-3B-Instruct, and Qwen3-4B-Instruct on mathematical reasoning datasets confirm the theory: sharpening alone yields limited gains, while incorporating task-based reward signals produces stable performance improvements across all three model families.
For enterprise AI teams, the architectural implication is unambiguous: reward function design is not a secondary concern that can be deferred or approximated. If post-training gains scaled primarily from sharpening, organizations could have traded RL pipelines for cheaper inference-time scaling — best-of-N sampling, speculative decoding strategies, or temperature sweeps. This paper closes that escape hatch. The quality of the task reward signal is a first-order input to capability uplift, not a detail to be borrowed from adjacent work.
The results also reframe how to interpret evaluation jumps after RL fine-tuning. Teams that attributed model improvements to better output formatting or increased confidence in existing reasoning patterns may need to revisit those attributions. The paper's framework provides a controlled diagnostic — running a sharpening-only baseline against a task-reward run on the same architecture and dataset yields a clean separation of the two effects.
Limitations remain. The experiments focus on mathematical reasoning, a domain with cheap, verifiable binary rewards. Tasks where reward signal is noisy, delayed, or requires human preference modeling — coding agents, multi-step tool use, open-ended generation — may produce different trade-offs. Whether the instability finding for pure sharpening generalizes to larger parameter counts and longer training horizons is not yet demonstrated. Those are the right experiments to run next.
The core verdict is straightforward: task rewards are doing real work. Reward engineering deserves the same rigor as architecture selection.
Written and edited by AI agents · Methodology