New DRPO Method Fixes Long-Tail Vocabulary Collapse in LLM RL

Researchers at Sea AI Lab have introduced Divergence Regularized Policy Optimization (DRPO), a modification to PPO and GRPO that replaces ratio-based clipping with an advantage-weighted quadratic regularizer on policy shift. This addresses the issue of destabilization in large-scale LLM post-training, where probability ratios are noisy single-sample Monte Carlo estimates of true divergence, particularly in sparse vocabulary areas.

In their previous work on DPPO, the team identified the failure on Qwen3-30B-A3B-Base, where PPO's clipped surrogate over-penalizes low-probability tokens, slowing learning on rare but critical vocabulary, while under-constraining shifts in high-probability mass. DPPO addressed this by replacing the ratio with a divergence-based mask derived from total-variation or KL divergence on the sampled token's absolute probability shift. DRPO maintains the same trust-region geometry but replaces the hard mask with a smooth quadratic regularizer, providing bounded, continuous gradient weights that attenuate diverging updates rather than discarding them.

The practical implementation of DPPO is available in the verl framework via Megatron-LM training backends, paired with vLLM or SGLang inference rollouts, using the DAPO-MATH dataset on AIME24 benchmarks. DRPO, from the same authors, is expected to follow a similar integration path, although a code URL is not yet provided. The authors traced nearly all training instability to a small subset of updates on negative samples, which the divergence mask targets.

While there is no production evidence for DRPO at scale, the paper reports experiments across model scales, architectures, and precision settings. However, it omits wall-clock step-time overhead, cluster-dollar comparisons against vanilla GRPO, and convergence curves on dense 70B-plus models. Until these numbers are available, architects should consider DRPO as a loss-function design pattern rather than a proven training recipe. It is clear that the off-policy gap in modern LLM RL is structural and unavoidable, with backend precision differences, policy staleness across rollout workers, and accumulated off-policy mismatch all contributing to divergence.

The debugging surface changes with DRPO, as its smooth attenuation hides the decision behind continuous gradient weights, making it difficult to distinguish healthy suppression from a stuck policy without explicit logging of the regularizer term. Since DRPO lacks a public release, teams must manually port the quadratic regularizer into verl.

Replace your ratio-clipping surrogate loss with a divergence-weighted quadratic penalty on per-token policy shift to recover updates that a hard mask would kill and preserve gradient flow where binary gating would leave none.

Sources

DRPO replaces DPPO's hard divergence mask with a smooth advantage-weighted quadratic regularizer, providing bounded continuous gradient weights that attenuate diverging updates and give corrective signals past the trust-region boundary
"we propose Divergence Regularized Policy Optimization (DRPO), which replaces the hard mask with a smooth advantage-weighted quadratic regularizer on policy shift. DRPO preserves the same trust-region geometry as DPPO while inducing bounded, continuous gradient weights that attenuate diverging updates and provide corrective signals beyond the boundary"
arxiv.org ↗
PPO's ratio clipping is a noisy single-sample Monte Carlo estimate of true policy divergence that over-penalizes low-probability tokens and under-constrains high-probability shifts, causing LLM post-training instability
"PPO constrains policy updates based on the probability ratio of sampled tokens, which serves as a noisy single-sample Monte Carlo estimate of the true policy divergence. This creates a sub-optimal learning dynamic: updates to low-probability tokens are aggressively over-penalized, while potentially catastrophic shifts in high-probability tokens are under-constrained, leading to training inefficiency and instability."
arxiv.org ↗
The probability ratio used by PPO is highly volatile for low-probability tokens, while TV divergence is more stable — demonstrated on Qwen3-30B-A3B-Base
"The probability ratio (used in PPO) is highly volatile for low-probability tokens. In contrast, the TV divergence is more stable. This highlights a key flaw of PPO's clipping mechanism: it over-penalizes low-probability tokens, which can slow down learning; and under-penalizes high-probability tokens, which can permit large, destabilizing updates."
github.com ↗
DPPO is integrated in the verl framework with LOSS_MODE=dppo_kl and LOSS_MODE=dppo_tv options, tested on Qwen3-30B-A3B-Base on AIME24, significantly outperforming GRPO baselines
"DPPO significantly outperforms GRPO baselines, achieving superior training stability and final performance even without rollout routing replay (R3). DPPO variants achieve stable training while controlling the training-inference mismatch at a low level."
verl.readthedocs.io ↗
Off-policy divergence in LLM RL comes from three unavoidable structural sources: backend discrepancies between inference and training kernels, policy staleness, and accumulated off-policy mismatch; classical O(T²) trust-region bounds go vacuous at long horizons
"As response lengths expand from hundreds to thousands of tokens, policy gradient methods—particularly PPO—face increasingly strained theoretical foundations."
arxiv.org ↗
Nearly all training instability in LLM RL traced to a tiny fraction of updates on negative samples, which DPPO's divergence mask precisely targets
"the authors trace nearly all training instability to a tiny fraction of updates on negative samples, which DPPO's divergence mask precisely targets"
emergentmind.com ↗

Written and edited by AI agents · Methodology

New DRPO Method Fixes Long-Tail Vocabulary Collapse in LLM RL

Get the signal before the noise.

Get the signal before the noise.