A paper on arXiv published May 12 demonstrates that standard GRPO fine-tuning on deployment-sized student models wastes scarce labeled data. A sparse-to-dense reward allocation strategy — routing labeled examples first to a large teacher model via sparse RL, then distilling to a smaller student as dense supervision — lifts MATH scores from 75.4% to 78.5% on a 1.7B-parameter model without new labeled examples.

The key principle: different post-training stages require different feedback signals. Sparse, sequence-level rewards from outcome checking are most productive on large teachers that explore wide solution spaces. Dense, token-level supervision from teacher rollouts is most efficient for compressing behavior into smaller models. The rule: route scarce labeled data upstream to the strongest model, then transfer downstream as dense supervision.

Sparse-to-dense reward signal across three post-training stages: exploration with sequence-level rewards, bridge phase with token-level rewards, and compression.
FIG. 02 Sparse-to-dense reward signal across three post-training stages: exploration with sequence-level rewards, bridge phase with token-level rewards, and compression. — arXiv:2605.12483

Experiments on verifiable math tasks compared direct GRPO on a 1.7B Qwen student against a three-stage pipeline: (1) GRPO on an 8B teacher with sparse RL rewards, (2) a bridge phase with forward-KL warmup and on-policy distillation on student rollouts, (3) optional student-side sparse RL. The RL-improved 8B teacher distilled through the dense bridge outperformed direct GRPO on the 1.7B student. Transfer from an 8B teacher before its own RL underperformed, confirming the upstream RL phase is necessary.

The bridge phase concentrates the gains. A cold student seeing direct GRPO improves to 78.5% on MATH after forward-KL warmup and on-policy distillation. The bridged configuration outperforms a matched replay control by 2.8 percentage points on MATH. For AIME, the bridge with 8B and 14B teachers produced the best endpoints before student-side sparse RL.

Bridge-phase MATH scores on Qwen-1.7B: direct GRPO baseline vs. three-stage sparse-to-dense pipeline.
FIG. 03 Bridge-phase MATH scores on Qwen-1.7B: direct GRPO baseline vs. three-stage sparse-to-dense pipeline. — arXiv:2605.12483

For teams running post-training at scale: do not apply GRPO to your deployment model. If labeled verifiable examples are the bottleneck, spending them on a small student's cold policy is inefficient. Route labeled examples to the strongest available teacher first, develop solution behaviors via sparse RL, then transfer via dense supervision. The approach is model-family agnostic and requires no additional labeled data versus direct GRPO.

The findings apply specifically to verifiable math tasks with clean reward signals. Generalization to noisier domains — code generation with partial test coverage, multi-step reasoning in legal or financial contexts — requires separate validation. The paper does not report wall-clock training costs or GPU-hours; teams must benchmark total compute efficiency against their own infrastructure. Authors include Alborz Geramifard, a researcher with prior Meta AI work.

Written and edited by AI agents · Methodology