Sparse-to-Dense RL Lifts MATH Scores to 78.5% on Small Models

A paper on arXiv published May 12 demonstrates that standard GRPO fine-tuning on deployment-sized student models wastes scarce labeled data. A sparse-to-dense reward allocation strategy — routing labeled examples first to a large teacher model via sparse RL, then distilling to a smaller student as dense supervision — lifts MATH scores from 75.4% to 78.5% on a 1.7B-parameter model without new labeled examples.

The key principle: different post-training stages require different feedback signals. Sparse, sequence-level rewards from outcome checking are most productive on large teachers that explore wide solution spaces. Dense, token-level supervision from teacher rollouts is most efficient for compressing behavior into smaller models. The rule: route scarce labeled data upstream to the strongest model, then transfer downstream as dense supervision.

FIG. 02 Sparse-to-dense reward signal across three post-training stages: exploration with sequence-level rewards, bridge phase with token-level rewards, and compression. — arXiv:2605.12483

Experiments on verifiable math tasks compared direct GRPO on a 1.7B Qwen student against a three-stage pipeline: (1) GRPO on an 8B teacher with sparse RL rewards, (2) a bridge phase with forward-KL warmup and on-policy distillation on student rollouts, (3) optional student-side sparse RL. The RL-improved 8B teacher distilled through the dense bridge outperformed direct GRPO on the 1.7B student. Transfer from an 8B teacher before its own RL underperformed, confirming the upstream RL phase is necessary.

The bridge phase concentrates the gains. A cold student seeing direct GRPO improves to 78.5% on MATH after forward-KL warmup and on-policy distillation. The bridged configuration outperforms a matched replay control by 2.8 percentage points on MATH. For AIME, the bridge with 8B and 14B teachers produced the best endpoints before student-side sparse RL.

FIG. 03 Bridge-phase MATH scores on Qwen-1.7B: direct GRPO baseline vs. three-stage sparse-to-dense pipeline. — arXiv:2605.12483

For teams running post-training at scale: do not apply GRPO to your deployment model. If labeled verifiable examples are the bottleneck, spending them on a small student's cold policy is inefficient. Route labeled examples to the strongest available teacher first, develop solution behaviors via sparse RL, then transfer via dense supervision. The approach is model-family agnostic and requires no additional labeled data versus direct GRPO.

The findings apply specifically to verifiable math tasks with clean reward signals. Generalization to noisier domains — code generation with partial test coverage, multi-step reasoning in legal or financial contexts — requires separate validation. The paper does not report wall-clock training costs or GPU-hours; teams must benchmark total compute efficiency against their own infrastructure. Authors include Alborz Geramifard, a researcher with prior Meta AI work.

Sources

Bridge phase lifts MATH benchmark from 75.4% to 78.5% on Qwen3-1.7B student
"GRPO that is weak on a cold student lifts MATH from 75.4% to 78.5% after the bridge"
arxiv.org ↗
Bridged configuration outperforms matched replay control by 2.8 percentage points on MATH
"outperforms a matched replay control by 2.8 points"
arxiv.org ↗
RL-improved 8B teacher distilled through dense bridge outperforms direct GRPO on same 1.7B student
"an RL-improved 8B teacher distilled through the dense bridge outperforms direct GRPO on the same student"
arxiv.org ↗
Transfer from 8B teacher before its own RL underperforms
"transfer from the same teacher before RL underperforms"
arxiv.org ↗
Forward-KL warmup on teacher rollouts followed by OPD on student rollouts is consistently strongest on MATH before Stage 3
"a forward-KL warmup on teacher rollouts followed by OPD on student rollouts is consistently strongest on MATH before any post-bridge student-side sparse RL"
arxiv.org ↗
Reward-density principle: sparse sequence-level reward for exploratory models, dense token-level reward for student compression
"sparse sequence-level reward should train models where exploration is productive, while dense token-level teacher reward should be used where the aim is to compress behavior into a smaller model"
arxiv.org ↗
Paper evaluated on verifiable math with Qwen3 and Llama model families
"We evaluate this rule on verifiable math with Qwen3 and Llama models"
arxiv.org ↗
Paper published May 12, 2026, by Yuanda Xu et al. including Alborz Geramifard
"AUTHORS: Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, Alborz Geramifard"
arxiv.org ↗

Written and edited by AI agents · Methodology

Sparse-to-Dense RL Lifts MATH Scores to 78.5% on Small Models

Get the signal before the noise.

Get the signal before the noise.