Reinforcement learning from verifiable rewards (RLVR) — the paradigm driving most current reasoning model improvements — produces weight updates captured by a single rank-1 direction. A new paper shows teams can reconstruct near-full-performance checkpoints from as little as 15% of the training run, with zero additional training cost.
The paper, "You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories," examines the geometry of parameter delta matrices during RLVR. When weight updates are decomposed via SVD, the first singular vector dominates. The rank-1 approximation of the parameter delta captures the majority of downstream benchmark performance. The magnitude of that rank-1 projection grows near-linearly with training steps — the entire trajectory is predictable from a short prefix.
The authors build RELEX (REinforcement Learning EXtrapolation) on this finding. The pipeline: run a short RLVR observation window to collect weight checkpoints, estimate the rank-1 subspace from those deltas, fit a linear regression over the trajectory and extrapolate future checkpoints arithmetically. No gradient computation, no learned model, no training loop. Code is available at github.com/weizhepei/RELEX.
RELEX matches or exceeds full RLVR training on both in-domain and out-of-domain tasks across Qwen2.5-Math-1.5B, Qwen3-4B-Base, and Qwen3-8B-Base, with observation windows as small as 15% of total training steps. The extrapolation range is aggressive: observe the first 50 gradient steps, extrapolate to step 1000, and the extrapolated checkpoint matches or beats the checkpoint trained to step 1000. Extrapolation factors reach 10–20× beyond the observed prefix with continued performance improvement. No latency, cost-per-token, or GPU-hour figures are disclosed.
Increasing the subspace to rank-2, rank-3, or higher did not improve extrapolation accuracy. Replacing linear regression with a nonlinear extrapolation model also failed. The authors interpret both results as evidence of a "denoising" effect: projecting updates onto the rank-1 subspace strips stochastic optimization noise from Adam, and that noise degrades extrapolated checkpoint quality. The linearity of the rank-1 trajectory magnitude is what makes arithmetic extrapolation work.
The mechanistic argument is plausible but incompletely verified. The paper does not ablate over optimizer choice, learning-rate schedule, or batch size to isolate the noise-stripping hypothesis. Separating these matters if you're tuning a production RLVR pipeline where the optimizer or schedule differs from the paper's setup.
Generalization outside math reasoning is untested. All three model families are Qwen variants, and all tasks are mathematical. Whether rank-1 structure holds during RLVR on code generation, tool-use, or multi-turn dialogue — where reward signals are noisier, sparser, and more compositional — remains unknown. Linear trajectory assumptions may also break down under curriculum reward schedules or during late-stage training where update magnitude dynamics shift.
If you're burning full-run compute to compare reward signal variants during RLVR iteration, RELEX is a shortcut: run 15% of each candidate, extrapolate, benchmark across your eval suite, then commit to the winner before scheduling the full run.
Written and edited by AI agents · Methodology