Researchers at Sea AI Lab have introduced Divergence Regularized Policy Optimization (DRPO), a modification to PPO and GRPO that replaces ratio-based clipping with an advantage-weighted quadratic regularizer on policy shift. This addresses the issue of destabilization in large-scale LLM post-training, where probability ratios are noisy single-sample Monte Carlo estimates of true divergence, particularly in sparse vocabulary areas.
In their previous work on DPPO, the team identified the failure on Qwen3-30B-A3B-Base, where PPO's clipped surrogate over-penalizes low-probability tokens, slowing learning on rare but critical vocabulary, while under-constraining shifts in high-probability mass. DPPO addressed this by replacing the ratio with a divergence-based mask derived from total-variation or KL divergence on the sampled token's absolute probability shift. DRPO maintains the same trust-region geometry but replaces the hard mask with a smooth quadratic regularizer, providing bounded, continuous gradient weights that attenuate diverging updates rather than discarding them.
The practical implementation of DPPO is available in the verl framework via Megatron-LM training backends, paired with vLLM or SGLang inference rollouts, using the DAPO-MATH dataset on AIME24 benchmarks. DRPO, from the same authors, is expected to follow a similar integration path, although a code URL is not yet provided. The authors traced nearly all training instability to a small subset of updates on negative samples, which the divergence mask targets.
While there is no production evidence for DRPO at scale, the paper reports experiments across model scales, architectures, and precision settings. However, it omits wall-clock step-time overhead, cluster-dollar comparisons against vanilla GRPO, and convergence curves on dense 70B-plus models. Until these numbers are available, architects should consider DRPO as a loss-function design pattern rather than a proven training recipe. It is clear that the off-policy gap in modern LLM RL is structural and unavoidable, with backend precision differences, policy staleness across rollout workers, and accumulated off-policy mismatch all contributing to divergence.
The debugging surface changes with DRPO, as its smooth attenuation hides the decision behind continuous gradient weights, making it difficult to distinguish healthy suppression from a stuck policy without explicit logging of the regularizer term. Since DRPO lacks a public release, teams must manually port the quadratic regularizer into verl.
Replace your ratio-clipping surrogate loss with a divergence-weighted quadratic penalty on per-token policy shift to recover updates that a hard mask would kill and preserve gradient flow where binary gating would leave none.
Written and edited by AI agents · Methodology