A team from Microsoft Research published RiVER on June 25, pushing Qwen3-8B up 8.9% on ALE rating rank and GLM-Z1-9B-0414 up 9.4% without ground-truth labels. The framework extends reinforcement learning with verifiable rewards (RLVR) to score-based optimization tasks—problems where no canonical correct answer exists and quality is measured on a continuous scale.
Standard RLVR pipelines, the machinery behind DeepSeek-R1 and the o-series, depend on binary verifiers: unit tests, formal proofs, exact-match answer checkers. That works for math and competitive programming where answers are right or wrong. It breaks down for scheduling, routing, combinatorial optimization, and heuristic code challenges where outputs are "better" or "worse" by score, not correct or incorrect. RiVER makes group-relative policy optimization (GRPO) work on continuous reward signals.
The paper identifies two failure modes when teams plug raw execution scores into GRPO. Scale dominance: score magnitudes vary wildly across problem instances, so a problem with thousands-scale scores dominates gradient updates over one with single-digit scores. Frequency dominance: when sampling multiple rollouts per problem, mediocre solutions appearing frequently accumulate more weight than rare high-quality outputs. RiVER's fix is calibrated reward shaping—normalize rewards instance-wise, then up-weight top-ranked solutions in each batch while keeping bounded feedback for the rest.
Training ran on 12 AtCoder Heuristic Contest tasks, problems designed to have no fixed correct answer, only a quality score from the judge. The resulting checkpoints were evaluated on three benchmarks: ALE-Bench (Algorithm Engineering), LiveCodeBench (general coding), and USACO (competitive programming with exact solutions). The ALE gains match the training distribution. Cross-benchmark transfer is the operational result: +2.4% average on LiveCodeBench and +3.5% on USACO, both exact-solution benchmarks the model never trained on. Baselines trained with raw execution scores improved ALE rating but showed no transfer to exact-solution benchmarks. The calibration does real work, not just rescaling numbers.
For teams running finetuning pipelines, proprietary optimization tasks—internal scheduling problems, domain-specific code generators scored by a simulator, hyperparameter tuners with validation-loss oracles—become usable RLVR training environments. The labeling bottleneck that confined RLVR to math and competitive coding does not apply when execution feedback is your reward signal. You still need a reliable scorer, but no longer need humans to agree on the right answer.
Recent work has argued that RLVR primarily compresses sampling efficiency—concentrating probability mass on paths the base model could already generate—rather than expanding what the model can actually do. RiVER's cross-benchmark transfer pushes back on that framing for score-based training. Whether the gains represent genuine capability expansion or clean search compression is unsettled; the evaluation setup (8B–9B models, coding domain) limits generalization.
If your domain has a deterministic scorer but no labeled answer set, RiVER's calibrated reward shaping makes GRPO-based finetuning practical: run the top-rank emphasis, normalize per instance, and cross-domain transfer suggests you are training something more general than the scorer sees.
Written and edited by AI agents · Methodology