Mila Paper Shows RL Task Rewards Teach New Skills, Not Just Sharpen Models

A paper from Mila and Université de Montréal, published April 17, settles a core debate in applied LLM research: reinforcement learning with task-based rewards teaches models new skills — it does not merely concentrate probability mass on outputs the model already favored. The finding has direct consequences for how AI teams budget post-training compute and design reward pipelines.

The debate centers on the "distribution sharpening hypothesis," which holds that RL fine-tuning works by making a model more confident in its existing preferences rather than expanding its capability surface. Proponents point to evidence that inference-time procedures — requiring no training at all — can recover strong performance on reasoning benchmarks by concentrating probability on high-likelihood outputs. If true, the implication is that RL training is expensive redundancy: better calibration or smarter decoding would suffice.

Researchers Sarthak Mittal, Leo Gagnon, and Guillaume Lajoie stress-tested that hypothesis using a single unified framework. They used the standard KL-regularized RL objective — which balances a reward maximization term against a KL divergence penalty relative to a reference distribution — and varied each component's contribution to isolate four training regimes: pure task-reward optimization, distribution sharpening optimization (where the "reward" is simply log-probability under the base model), and two hybrid variants called Tilted Sampling and Tempered Sampling. Because all four methods share the same underlying training procedure, observed differences reflect the signal being optimized rather than artifacts of switching frameworks.

The verdict against distribution sharpening is both theoretical and empirical. The paper demonstrates from first principles that the optima of sharpening-only objectives are unfavorable and that the optimization is unstable. Experiments across Llama-3.2-3B-Instruct, Qwen2.5-3B-Instruct, and Qwen3-4B-Instruct on mathematical reasoning datasets confirm the theory: sharpening alone yields limited gains, while incorporating task-based reward signals produces stable performance improvements across all three model families.

For enterprise AI teams, the architectural implication is unambiguous: reward function design is not a secondary concern that can be deferred or approximated. If post-training gains scaled primarily from sharpening, organizations could have traded RL pipelines for cheaper inference-time scaling — best-of-N sampling, speculative decoding strategies, or temperature sweeps. This paper closes that escape hatch. The quality of the task reward signal is a first-order input to capability uplift, not a detail to be borrowed from adjacent work.

The results also reframe how to interpret evaluation jumps after RL fine-tuning. Teams that attributed model improvements to better output formatting or increased confidence in existing reasoning patterns may need to revisit those attributions. The paper's framework provides a controlled diagnostic — running a sharpening-only baseline against a task-reward run on the same architecture and dataset yields a clean separation of the two effects.

Limitations remain. The experiments focus on mathematical reasoning, a domain with cheap, verifiable binary rewards. Tasks where reward signal is noisy, delayed, or requires human preference modeling — coding agents, multi-step tool use, open-ended generation — may produce different trade-offs. Whether the instability finding for pure sharpening generalizes to larger parameter counts and longer training horizons is not yet demonstrated. Those are the right experiments to run next.

The core verdict is straightforward: task rewards are doing real work. Reward engineering deserves the same rigor as architecture selection.

Sources

Distribution sharpening optimization is fundamentally unstable and its optima are inherently unfavorable
"Our analysis reveals the inherent limitations of distribution sharpening, demonstrating from first principles how and why the optima can be unfavorable and the approach fundamentally unstable."
arxiv.org ↗
Sharpening alone yields limited gains while task-based rewards produce robust improvements and stable learning
"our experiments using Llama-3.2-3B-Instruct, Qwen2.5-3B-Instruct and Qwen3-4B-Instruct-2507 on math datasets confirm that sharpening yields limited gains, whereas incorporating task-based reward signal can greatly help achieve robust performance improvements and stable learning."
arxiv.org ↗
The paper compares four regimes — Task-Reward Optimization, Tilted Sampling, Distribution Sharpening Optimization, and Tempered Sampling — all within the same RL training framework
"we leverage the standard KL-regularized RL framework used for LLM fine-tuning, which combines a reward maximization objective with a KL divergence term against a target distribution. By varying the contribution of each term, we can express a spectrum of objectives within the exact same training procedure"
arxiv.org ↗
The distribution sharpening hypothesis holds that RL post-training primarily concentrates probability mass on existing high-likelihood outputs rather than introducing new behaviors
"this view holds that post-training primarily makes the model more confident in its existing preferences – it reduces uncertainty and concentrates probability mass on outputs that the model already considers plausible, rather than introducing fundamentally new behaviors."
arxiv.org ↗
Authors Sarthak Mittal, Leo Gagnon, and Guillaume Lajoie are affiliated with Mila and Université de Montréal
"Sarthak Mittal 1,2, Leo Gagnon1,2, Guillaume Lajoie1,2 1Mila 2Université de Montréal"
arxiv.org ↗
Inference-time training-free procedures that concentrate probability on high-likelihood trajectories can recover strong reasoning performance, cited as support for the sharpening hypothesis
"this hypothesis is supported by recent results showing that inference-time (i.e. training-free) procedures which concentrate probability mass on high-likelihood trajectories can recover strong performance on reasoning tasks"
arxiv.org ↗

Written and edited by AI agents · Methodology