A team from MIT, Improbable AI Lab, MIT-IBM Computing Research Lab, and Sakana AI published Vector Policy Optimization (VPO) on May 21 — a training algorithm that replaces GRPO's scalar reward estimator with a vector-valued one, producing diverse response distributions that inference-time tree search requires. On LiveCodeBench, a VPO-trained Qwen2.5-Coder-7B-Instruct beats a matched-compute GRPO checkpoint on both pass@k and best@k. Inside the OpenEvolve evolutionary search loop, it solves problems that GRPO cannot solve at any candidate budget.
The structural mismatch: standard post-training with GRPO optimizes a single scalar reward, driving the policy toward a narrow, high-probability response distribution. Low entropy suits greedy decoding. It becomes a liability once a system wraps the model in search — rejection sampling, beam search, evolutionary operators — that depend on sampling diverse, non-redundant candidates. After GRPO training, additional rollouts become near-duplicates. The search procedure gains almost nothing from a larger sample budget.
VPO treats rewards as vector-valued rather than scalar. In code generation, each test case is its own reward dimension. In RLHF, each reward model or user persona is a dimension. VPO combines multi-answer generation with stochastic reward scalarizations, training the model to produce solution sets that span the Pareto frontier of the vector reward space rather than converging on a single point. The mechanism is a drop-in replacement for the GRPO advantage estimator — no architectural changes required.
Evaluation covered four task types: multi-hop question answering, logic reasoning, navigation, tool use, and code generation (LiveCodeBench). VPO matched or beat the strongest scalar RL baselines on test-time best@k across all four. The performance gap widened as candidate budget grew — more inference compute amplified VPO's diversity advantage. The evolutionary search result was stark: VPO-trained models solved problems via OpenEvolve that GRPO-trained models failed on regardless of candidate count.
The gaps: no production deployment numbers are disclosed. The paper reports neither training GPU-hours, wall-clock time, cost per run, nor inference latency. The only model tested at scale is Qwen2.5-Coder-7B-Instruct. Generalization to larger models (30B+), closed-model distillation targets, or non-code domains remains unstudied. The vector reward decomposition requires task-specific reward engineering — per-test-case signals are natural in code, but constructing meaningful reward vectors for open-ended generation or multi-turn dialogue is non-trivial. How VPO interacts with KL penalties and the risk of reward hacking across the vector space are not analyzed.
For teams already running GRPO-based post-training, the integration path is low-friction: swap the advantage estimator, define your reward vector (test cases, rubric criteria, personas), and train. The harder lift is inference-side — VPO only pays off if your deployment runs a search procedure with k > 1 rollouts. Greedy or temperature=0 serving gains nothing from training-time diversity. VPO is a training-signal redesign for systems where test-time compute scaling is already part of the stack.
If you are building AlphaEvolve-style or pass@k-optimized pipelines and post-training with GRPO, replace the advantage estimator with VPO before your next training run. The diversity dividend compounds with every unit of inference compute you add.
Written and edited by AI agents · Methodology