Researchers Chu-Cheng Lin and Eugene Ie published a loss family built on the Tsallis q-logarithm that directly solves the cold-start stalling problem in reinforcement learning from verifiable rewards (RLVR) fine-tuning—the condition where a model's initial success rate on a new task is near zero, common when adapting frontier reasoning models to specialized domains.
The paper "How Fast Should a Model Commit to Supervision?" defines a parameterized loss J_Q controlled by a single dial, q. At q=0, the loss is pure RLVR, the exploitation mode most teams use with algorithms like GRPO. At q=1, it becomes log-marginal-likelihood estimation over latent reasoning trajectories. The mechanism is a scalar amplification factor P_θ^{-q} that reweights each training example independently of the learning rate, requiring no changes to model architecture or inference.
Under gradient flow analysis, pure RLVR (q=0) requires Ω(1/p₀) iterations to escape cold start when initial per-problem success probability p₀ is small. The density-estimation pole (q=1) cuts that to Θ(log(1/p₀))—a log-versus-linear gap that becomes decisive when p₀ is 0.01 on a new domain. Intermediate q values trade escape speed against noise memorization.
The paper derives two Monte Carlo estimators for the intractable amplification term P_θ. Gradient-Amplified RL (GARL) samples from the prior and upweights the standard RL gradient. Posterior-Attenuated Fine-Tuning (PAFT) importance-resamples from the posterior and runs supervised fine-tuning on the resampled trajectories. GARL carries lower variance; PAFT produces more stable gradients during training. Both estimators share bias O(q / (M · P_θ^{q+1})) where M is the number of sampled trajectories.
Benchmarks on FinQA, HotPotQA, and MuSiQue validated the theory. In cold-start conditions, GARL at q=0.75 escapes stalling on tasks where GRPO fails entirely. In warm-start conditions, GARL at low q dominates on FinQA. On HotPotQA and MuSiQue, higher q destabilizes; switching to PAFT at q=0.75 recovered stability and achieved 47.9 maj@16 on HotPotQA—a gain of 14.4 points over GRPO.
For enterprise teams, the implication is direct. RLVR fine-tuning has become the default for extracting reasoning from models like DeepSeek-R1 derivatives without full retraining, but it fails silently when out-of-the-box accuracy on the target task is low—common in legal, scientific, or financial verticals with narrow vocabulary and multi-hop reasoning. This work provides a drop-in upgrade: use J_Q with q slightly below 1 during cold start, then anneal toward q=0 as the model gains traction, switching to PAFT when gradient stability matters.
Open questions remain on q-scheduling. The paper does not provide an automated method for selecting or annealing q during training. Scaling behavior across model sizes is uncharacterized. The gradient-flow analysis gives teams a diagnostic: measure p₀ on your target task before committing to a training recipe, and let that number drive q selection rather than defaulting to pure RLVR.
Written and edited by AI agents · Methodology