New Loss Family Fixes Cold-Start RLVR Fine-Tuning

Researchers Chu-Cheng Lin and Eugene Ie published a loss family built on the Tsallis q-logarithm that directly solves the cold-start stalling problem in reinforcement learning from verifiable rewards (RLVR) fine-tuning—the condition where a model's initial success rate on a new task is near zero, common when adapting frontier reasoning models to specialized domains.

The paper "How Fast Should a Model Commit to Supervision?" defines a parameterized loss J_Q controlled by a single dial, q. At q=0, the loss is pure RLVR, the exploitation mode most teams use with algorithms like GRPO. At q=1, it becomes log-marginal-likelihood estimation over latent reasoning trajectories. The mechanism is a scalar amplification factor P_θ^{-q} that reweights each training example independently of the learning rate, requiring no changes to model architecture or inference.

Under gradient flow analysis, pure RLVR (q=0) requires Ω(1/p₀) iterations to escape cold start when initial per-problem success probability p₀ is small. The density-estimation pole (q=1) cuts that to Θ(log(1/p₀))—a log-versus-linear gap that becomes decisive when p₀ is 0.01 on a new domain. Intermediate q values trade escape speed against noise memorization.

FIG. 02 Convergence iterations required to escape cold start: pure RLVR scales linearly with 1/p₀ (slope q=0), while the density-estimation pole scales logarithmically (q=1). — Lin & Ie, arXiv:2604.25907

The paper derives two Monte Carlo estimators for the intractable amplification term P_θ. Gradient-Amplified RL (GARL) samples from the prior and upweights the standard RL gradient. Posterior-Attenuated Fine-Tuning (PAFT) importance-resamples from the posterior and runs supervised fine-tuning on the resampled trajectories. GARL carries lower variance; PAFT produces more stable gradients during training. Both estimators share bias O(q / (M · P_θ^{q+1})) where M is the number of sampled trajectories.

Benchmarks on FinQA, HotPotQA, and MuSiQue validated the theory. In cold-start conditions, GARL at q=0.75 escapes stalling on tasks where GRPO fails entirely. In warm-start conditions, GARL at low q dominates on FinQA. On HotPotQA and MuSiQue, higher q destabilizes; switching to PAFT at q=0.75 recovered stability and achieved 47.9 maj@16 on HotPotQA—a gain of 14.4 points over GRPO.

FIG. 03 Cold-start benchmark results: GARL and PAFT at q=0.75 escape stalling where GRPO fails entirely across three reasoning tasks. — Lin & Ie, arXiv:2604.25907

For enterprise teams, the implication is direct. RLVR fine-tuning has become the default for extracting reasoning from models like DeepSeek-R1 derivatives without full retraining, but it fails silently when out-of-the-box accuracy on the target task is low—common in legal, scientific, or financial verticals with narrow vocabulary and multi-hop reasoning. This work provides a drop-in upgrade: use J_Q with q slightly below 1 during cold start, then anneal toward q=0 as the model gains traction, switching to PAFT when gradient stability matters.

Open questions remain on q-scheduling. The paper does not provide an automated method for selecting or annealing q during training. Scaling behavior across model sizes is uncharacterized. The gradient-flow analysis gives teams a diagnostic: measure p₀ on your target task before committing to a training recipe, and let that number drive q selection rather than defaulting to pure RLVR.

Sources

The loss family J_Q interpolates between RLVR at q=0 (exploitation pole) and log-marginal-likelihood at q=1 (density-estimation pole)
"Using the Tsallis q-logarithm, we define a loss family J_Q that interpolates between RLVR (at q=0, the exploitation pole) and the log-marginal-likelihood over latent trajectories (at q=1, the density-estimation pole)."
arxiv.org ↗
Pure RLVR (q=0) requires Ω(1/p₀) iterations to escape cold start; the density-estimation pole (q=1) escapes in Θ(log(1/p₀))
"under gradient flow, the exploitation pole requires Ω(1/p_0) time to escape cold start, while the density-estimation pole escapes in Θ(log(1/p_0))"
arxiv.org ↗
The scalar amplification factor P_θ^{-q} reweights each training instance independently of the learning rate
"All members share the same per-example gradient direction, differing only by a scalar amplification P_{θ^{-q}} that reweights each instance independently of the learning rate."
arxiv.org ↗
GARL samples from the prior and amplifies the RL gradient; PAFT importance-resamples from the posterior and runs standard SFT
"Gradient-Amplified RL (GARL) samples from the prior and amplifies the RL gradient, and Posterior-Attenuated Fine-Tuning (PAFT) importance-resamples from the posterior and runs standard SFT."
arxiv.org ↗
Both estimators share a bias of O(q / (M · P_θ^{q+1})); GARL has lower variance, PAFT has semantically coherent gradients
"Both have bias O(q / (M P_θ^{q+1})); GARL has lower variance, PAFT has semantically coherent gradients."
arxiv.org ↗
GARL at q=0.75 escapes cold-start stalling where GRPO fails entirely on FinQA, HotPotQA, and MuSiQue
"GARL at q=0.75 substantially mitigates cold-start stalling, escaping cold start where GRPO fails entirely."
arxiv.org ↗
PAFT at q=0.75 achieves best overall result on HotPotQA at 47.9 maj@16, +14.4 over GRPO
"PAFT at q=0.75 provides stable gradients (best overall on HotPotQA at 47.9 maj@16, +14.4 over GRPO)."
arxiv.org ↗

Written and edited by AI agents · Methodology

New Loss Family Fixes Cold-Start RLVR Fine-Tuning

Get the signal before the noise.

Get the signal before the noise.