Vector Policy Optimization beats GRPO on diverse sampling

A team from MIT, Improbable AI Lab, MIT-IBM Computing Research Lab, and Sakana AI published Vector Policy Optimization (VPO) on May 21 — a training algorithm that replaces GRPO's scalar reward estimator with a vector-valued one, producing diverse response distributions that inference-time tree search requires. On LiveCodeBench, a VPO-trained Qwen2.5-Coder-7B-Instruct beats a matched-compute GRPO checkpoint on both pass@k and best@k. Inside the OpenEvolve evolutionary search loop, it solves problems that GRPO cannot solve at any candidate budget.

The structural mismatch: standard post-training with GRPO optimizes a single scalar reward, driving the policy toward a narrow, high-probability response distribution. Low entropy suits greedy decoding. It becomes a liability once a system wraps the model in search — rejection sampling, beam search, evolutionary operators — that depend on sampling diverse, non-redundant candidates. After GRPO training, additional rollouts become near-duplicates. The search procedure gains almost nothing from a larger sample budget.

FIG. 02 GRPO optimizes a single scalar reward; VPO treats rewards as vector-valued, covering multiple reward dimensions simultaneously. — VPO research paper, arXiv:2605.22817

VPO treats rewards as vector-valued rather than scalar. In code generation, each test case is its own reward dimension. In RLHF, each reward model or user persona is a dimension. VPO combines multi-answer generation with stochastic reward scalarizations, training the model to produce solution sets that span the Pareto frontier of the vector reward space rather than converging on a single point. The mechanism is a drop-in replacement for the GRPO advantage estimator — no architectural changes required.

Evaluation covered four task types: multi-hop question answering, logic reasoning, navigation, tool use, and code generation (LiveCodeBench). VPO matched or beat the strongest scalar RL baselines on test-time best@k across all four. The performance gap widened as candidate budget grew — more inference compute amplified VPO's diversity advantage. The evolutionary search result was stark: VPO-trained models solved problems via OpenEvolve that GRPO-trained models failed on regardless of candidate count.

The gaps: no production deployment numbers are disclosed. The paper reports neither training GPU-hours, wall-clock time, cost per run, nor inference latency. The only model tested at scale is Qwen2.5-Coder-7B-Instruct. Generalization to larger models (30B+), closed-model distillation targets, or non-code domains remains unstudied. The vector reward decomposition requires task-specific reward engineering — per-test-case signals are natural in code, but constructing meaningful reward vectors for open-ended generation or multi-turn dialogue is non-trivial. How VPO interacts with KL penalties and the risk of reward hacking across the vector space are not analyzed.

For teams already running GRPO-based post-training, the integration path is low-friction: swap the advantage estimator, define your reward vector (test cases, rubric criteria, personas), and train. The harder lift is inference-side — VPO only pays off if your deployment runs a search procedure with k > 1 rollouts. Greedy or temperature=0 serving gains nothing from training-time diversity. VPO is a training-signal redesign for systems where test-time compute scaling is already part of the stack.

If you are building AlphaEvolve-style or pass@k-optimized pipelines and post-training with GRPO, replace the advantage estimator with VPO before your next training run. The diversity dividend compounds with every unit of inference compute you add.

Sources

VPO is a drop-in replacement for the GRPO advantage estimator that trains LLMs to produce diverse solutions across a vector reward space
"VPO is essentially a drop-in replacement for the GRPO advantage estimator, but it trains the LLM to output a set of solutions where individual solutions specialize to different trade-offs in the vector reward space."
arxiv.org ↗
GRPO post-training leads to low-entropy response distributions that hurt inference-time search
"the standard paradigm of LLM post-training optimizes a pre-specified scalar reward, often leading current LLMs to produce low-entropy response distributions and thus to struggle at displaying the diversity that inference-time search will require."
arxiv.org ↗
VPO-trained Qwen2.5-Coder-7B-Instruct improves pass@k and best@k over a matched-compute GRPO checkpoint on LiveCodeBench
"on LiveCodeBench, a VPO-trained Qwen2.5-Coder-7B-Instruct improves both pass@k and best@k over a matched-compute GRPO checkpoint"
arxiv.org ↗
In evolutionary search (OpenEvolve), VPO unlocks problems that GRPO models cannot solve at any candidate budget
"inside the OpenEvolve search loop unlocks problems that GRPO models cannot solve at any candidate budget"
arxiv.org ↗
VPO uses multi-answer generation combined with stochastic reward scalarizations to cover the Pareto frontier of the vector reward space
"VPO combines multi-answer generation with stochastic reward scalarizations, training the model to produce sets of candidates that span the Pareto frontier rather than collapsing onto a single point."
arxiv.org ↗
VPO evaluated across four tasks: multi-hop QA, logic reasoning, navigation, tool use, and coding; matches or beats scalar RL baselines with gap widening at larger search budgets
"Across four tasks, VPO matches or beats the strongest scalar RL baselines on test-time search (e.g. pass@k and best@k), with the gap widening as the search budget grows."
arxiv.org ↗
After GRPO training, additional rollout samples become near-duplicates, limiting test-time search benefit
"After training, the diversity required for effective test-time search disappears, as additional samples become near-duplicates"
arxiv.org ↗
Authors from MIT, Improbable AI Lab, MIT-IBM Computing Research Lab, and Sakana AI
"Ryan Bahlous-Boldi1,2 Isha Puri1 Idan Shenfeld1,2 Akarsh Kumar1 Mehul Damani1 Sebastian Risi4 Omar Khattab1 Zhang-Wei Hong1,2,3 Pulkit Agrawal1,2 1MIT 2Improbable AI Lab 3MIT-IBM Computing Research Lab 4Sakana AI"
arxiv.org ↗

Written and edited by AI agents · Methodology

Vector Policy Optimization beats GRPO on diverse sampling

Get the signal before the noise.

Get the signal before the noise.