RiVER Enables Reinforcement Learning Without Ground-Truth Labels

A team from Microsoft Research published RiVER on June 25, pushing Qwen3-8B up 8.9% on ALE rating rank and GLM-Z1-9B-0414 up 9.4% without ground-truth labels. The framework extends reinforcement learning with verifiable rewards (RLVR) to score-based optimization tasks—problems where no canonical correct answer exists and quality is measured on a continuous scale.

FIG. 02 RiVER performance gains: model improvements on ALE rating rank versus benchmark improvements on exact-solution tasks. — Microsoft Research RiVER paper, arxiv.org/abs/2606.27369v1

Standard RLVR pipelines, the machinery behind DeepSeek-R1 and the o-series, depend on binary verifiers: unit tests, formal proofs, exact-match answer checkers. That works for math and competitive programming where answers are right or wrong. It breaks down for scheduling, routing, combinatorial optimization, and heuristic code challenges where outputs are "better" or "worse" by score, not correct or incorrect. RiVER makes group-relative policy optimization (GRPO) work on continuous reward signals.

The paper identifies two failure modes when teams plug raw execution scores into GRPO. Scale dominance: score magnitudes vary wildly across problem instances, so a problem with thousands-scale scores dominates gradient updates over one with single-digit scores. Frequency dominance: when sampling multiple rollouts per problem, mediocre solutions appearing frequently accumulate more weight than rare high-quality outputs. RiVER's fix is calibrated reward shaping—normalize rewards instance-wise, then up-weight top-ranked solutions in each batch while keeping bounded feedback for the rest.

Training ran on 12 AtCoder Heuristic Contest tasks, problems designed to have no fixed correct answer, only a quality score from the judge. The resulting checkpoints were evaluated on three benchmarks: ALE-Bench (Algorithm Engineering), LiveCodeBench (general coding), and USACO (competitive programming with exact solutions). The ALE gains match the training distribution. Cross-benchmark transfer is the operational result: +2.4% average on LiveCodeBench and +3.5% on USACO, both exact-solution benchmarks the model never trained on. Baselines trained with raw execution scores improved ALE rating but showed no transfer to exact-solution benchmarks. The calibration does real work, not just rescaling numbers.

For teams running finetuning pipelines, proprietary optimization tasks—internal scheduling problems, domain-specific code generators scored by a simulator, hyperparameter tuners with validation-loss oracles—become usable RLVR training environments. The labeling bottleneck that confined RLVR to math and competitive coding does not apply when execution feedback is your reward signal. You still need a reliable scorer, but no longer need humans to agree on the right answer.

Recent work has argued that RLVR primarily compresses sampling efficiency—concentrating probability mass on paths the base model could already generate—rather than expanding what the model can actually do. RiVER's cross-benchmark transfer pushes back on that framing for score-based training. Whether the gains represent genuine capability expansion or clean search compression is unsettled; the evaluation setup (8B–9B models, coding domain) limits generalization.

If your domain has a deterministic scorer but no labeled answer set, RiVER's calibrated reward shaping makes GRPO-based finetuning practical: run the top-rank emphasis, normalize per instance, and cross-domain transfer suggests you are training something more general than the scorer sees.

Sources

RiVER advances Qwen3-8B by 8.9% and GLM-Z1-9B-0414 by 9.4% in ALE rating rank, trained on 12 AtCoder Heuristic Contest tasks without ground-truth solutions
"RiVER advances Qwen3-8B and GLM-Z1-9B-0414 by 8.9% and 9.4% in ALE rating rank"
arxiv.org ↗
RiVER improves exact-solution benchmarks LiveCodeBench by 2.4% and USACO by 3.5% absolute average despite training only on score-based tasks
"RiVER also improves the backbones across exact-solution benchmarks such as LiveCodeBench and USACO by an absolute average improvement of 2.4% and 3.5%"
arxiv.org ↗
Baselines trained with raw execution scores improve ALE rating but fail to transfer to exact-solution benchmarks
"baselines trained with raw execution scores improve ALE rating but fail to transfer to exact-solution benchmarks"
arxiv.org ↗
Scale dominance (uncalibrated score magnitudes distort policy updates) and frequency dominance (suboptimal solutions outweigh rare stronger candidates) are the two key challenges RiVER identifies and addresses
"scale dominance, where uncalibrated score magnitudes across test instances distort policy updates, and frequency dominance, where repeatedly sampled suboptimal solutions can outweigh rare but stronger candidates"
arxiv.org ↗
Standard RLVR relies on ground-truth answers to assign rewards, limiting applicability to tasks where the ground-truth solution is unknown
"Reinforcement learning with verifiable rewards (RLVR) for training LLMs typically rely on ground-truth answers to assign rewards, limiting their applicability to tasks where the ground-truth solution is unknown"
arxiv.org ↗
Current RLVR models often exhibit narrower reasoning coverage than their base models; as training progresses, pass@1 improves but pass@256 coverage decreases
"Current RLVR models often exhibit narrower reasoning coverage than their base models. In pass@k, it is surprising that base models consistently surpass RLVR models across all benchmarks"
arxiv.org ↗
RLVR eliminates need for separate critic or reward models and can achieve strong results with limited training data, scaling without human intervention
"This approach offers practical benefits by eliminating the need for separate critic or reward models, and can achieve strong results with limited training data"
labelstud.io ↗
RiVER uses calibrated reward shaping with instance-wise comparisons that emphasizes top-ranked solvers while retaining bounded feedback for other valid solutions
"calibrated reward shaping that uses instance-wise comparisons and emphasizes top-ranked solvers while retaining bounded feedback for other valid solutions"
arxiv.org ↗

Written and edited by AI agents · Methodology

RiVER Enables Reinforcement Learning Without Ground-Truth Labels

Get the signal before the noise.

Get the signal before the noise.