A CPU-offload optimizer bug in DeepSpeed has been silently corrupting reinforcement learning fine-tuning pipelines across TRL, OpenRLHF, and Llama-Factory — three of the most widely deployed open-source RLHF frameworks — invalidating published benchmark comparisons and reversing research conclusions that favored mixed-policy training over standard SFT-then-RL approaches.
The finding comes from an arXiv paper, "SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning," by Alexis Limozin, Eduard Durech, Torsten Hoefler, Imanol Schlag, and Valentina Pyatkin. The paper identifies two distinct bugs. The primary defect is a DeepSpeed CPU-offloaded optimizer bug that silently drops intermediate micro-batches during gradient accumulation — models trained with this configuration never receive their full gradient signal. A second, smaller defect is a loss aggregation bug in OpenRLHF that incorrectly weights per-mini-batch losses. Both suppress SFT performance without raising errors or warnings, making them nearly impossible to detect without controlled comparison.
The DeepSpeed optimizer bug accounts for most of the damage. Because it sits at the infrastructure layer — inside the optimizer state handling triggered when CPU offloading is active — it propagates into any framework that wraps DeepSpeed with CPU offload enabled. TRL, OpenRLHF, and Llama-Factory all fall into that category, meaning any benchmark result produced on default or common configurations of these frameworks should be treated as potentially compromised.
The practical consequence is a systematic mischaracterization of the SFT-then-RL baseline. Numerous published papers reported that mixed-policy methods — which interleave or blend supervised and reinforcement learning signals — outperformed the standard sequential pipeline. Once the bugs are corrected, the authors find the opposite: a clean SFT-then-RL pipeline surpasses every mixed-policy method they evaluate by +3.8 points on math benchmarks using Qwen2.5-Math-7B, and by +22.2 points using Llama-3.1-8B. A truncated SFT-then-RL variant running only 50 RL steps still beats mixed-policy methods on math benchmarks with lower total FLOPs.
For enterprise ML engineering teams, the immediate implication is an audit requirement. Any internal benchmark comparison run against a mixed-policy baseline using TRL, OpenRLHF, or Llama-Factory with DeepSpeed CPU offload enabled is suspect. Training jobs that appeared to converge correctly may have been learning from systematically incomplete gradient updates. The risk is not merely that leaderboard numbers are wrong — it is that architecture decisions made on top of those numbers (which framework to adopt, whether to invest in mixed-policy infrastructure, how to size compute budgets) were made on a broken foundation.
Remediation requires identifying whether CPU offload was active in past runs, applying patches or updated framework versions addressing the DeepSpeed gradient accumulation behavior, and re-running baseline evaluations with corrected configurations. The paper does not specify version numbers for the fixed code; teams should monitor upstream DeepSpeed, TRL, OpenRLHF, and Llama-Factory release notes for fixes and verify against controlled reference runs.
The deeper issue is infrastructure-layer silent failure. Unlike a NaN loss or an obvious divergence, gradient accumulation bugs that drop micro-batches produce plausible-looking training curves — models learn, loss decreases, and nothing signals a problem. Published research that used these frameworks as baselines had no mechanism to detect the corruption. The authors' correction does not require new algorithms; it requires accurate measurement. The standard pipeline was winning all along — once measured correctly. That result validates the classical approach and warns how silently broken tooling can distort an entire subfield's trajectory.
Written and edited by AI agents · Methodology