RELEX reconstructs RLVR checkpoints from 15% training data

Reinforcement learning from verifiable rewards (RLVR) — the paradigm driving most current reasoning model improvements — produces weight updates captured by a single rank-1 direction. A new paper shows teams can reconstruct near-full-performance checkpoints from as little as 15% of the training run, with zero additional training cost.

The paper, "You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories," examines the geometry of parameter delta matrices during RLVR. When weight updates are decomposed via SVD, the first singular vector dominates. The rank-1 approximation of the parameter delta captures the majority of downstream benchmark performance. The magnitude of that rank-1 projection grows near-linearly with training steps — the entire trajectory is predictable from a short prefix.

The authors build RELEX (REinforcement Learning EXtrapolation) on this finding. The pipeline: run a short RLVR observation window to collect weight checkpoints, estimate the rank-1 subspace from those deltas, fit a linear regression over the trajectory and extrapolate future checkpoints arithmetically. No gradient computation, no learned model, no training loop. Code is available at github.com/weizhepei/RELEX.

RELEX matches or exceeds full RLVR training on both in-domain and out-of-domain tasks across Qwen2.5-Math-1.5B, Qwen3-4B-Base, and Qwen3-8B-Base, with observation windows as small as 15% of total training steps. The extrapolation range is aggressive: observe the first 50 gradient steps, extrapolate to step 1000, and the extrapolated checkpoint matches or beats the checkpoint trained to step 1000. Extrapolation factors reach 10–20× beyond the observed prefix with continued performance improvement. No latency, cost-per-token, or GPU-hour figures are disclosed.

Increasing the subspace to rank-2, rank-3, or higher did not improve extrapolation accuracy. Replacing linear regression with a nonlinear extrapolation model also failed. The authors interpret both results as evidence of a "denoising" effect: projecting updates onto the rank-1 subspace strips stochastic optimization noise from Adam, and that noise degrades extrapolated checkpoint quality. The linearity of the rank-1 trajectory magnitude is what makes arithmetic extrapolation work.

The mechanistic argument is plausible but incompletely verified. The paper does not ablate over optimizer choice, learning-rate schedule, or batch size to isolate the noise-stripping hypothesis. Separating these matters if you're tuning a production RLVR pipeline where the optimizer or schedule differs from the paper's setup.

Generalization outside math reasoning is untested. All three model families are Qwen variants, and all tasks are mathematical. Whether rank-1 structure holds during RLVR on code generation, tool-use, or multi-turn dialogue — where reward signals are noisier, sparser, and more compositional — remains unknown. Linear trajectory assumptions may also break down under curriculum reward schedules or during late-stage training where update magnitude dynamics shift.

If you're burning full-run compute to compare reward signal variants during RLVR iteration, RELEX is a shortcut: run 15% of each candidate, extrapolate, benchmark across your eval suite, then commit to the winner before scheduling the full run.

Sources

RLVR weight trajectories are extremely low-rank; rank-1 approximation of parameter deltas captures majority of downstream performance gains
"we find that the majority of downstream performance gains are captured by a rank-1 approximation of the parameter deltas, where the magnitude of this projection evolves near-linearly with training steps"
arxiv.org ↗
RELEX requires as few as 15% of full RLVR training steps to match or exceed full-run performance
"RELEX produces checkpoints that match or exceed RLVR performance on both in-domain and out-of-domain benchmarks, requiring as few as 15% steps of full RLVR training"
arxiv.org ↗
RELEX extrapolates 10–20× beyond the observed prefix; e.g., observe 50 steps and extrapolate to 1000 steps with continued improvement
"RELEX is able to extrapolate far beyond the observation window at no training cost, predicting checkpoints up to 10-20× beyond the observed prefix with continued improvement (e.g., observe only the first 50 steps and extrapolate to 1000 steps)"
arxiv.org ↗
Models evaluated: Qwen2.5-Math-1.5B, Qwen3-4B-Base, and Qwen3-8B-Base
"Across three models (i.e., Qwen2.5-Math-1.5B, Qwen3-4B-Base, and Qwen3-8B-Base)"
arxiv.org ↗
Neither increasing subspace rank beyond 1 nor using nonlinear modeling improved extrapolation results
"neither increasing the subspace rank nor employing non-linear modeling yields further gains in extrapolation"
arxiv.org ↗
RELEX's performance gains stem from a denoising effect: projecting updates onto the rank-1 subspace discards stochastic optimization noise
"RELEX's success stems from a 'denoising' effect: by projecting updates onto the rank-1 subspace, the model discards stochastic optimization noise that would otherwise degrade performance during extrapolation"
arxiv.org ↗
Code is available at github.com/weizhepei/RELEX
"Our code is available at https://github.com/weizhepei/RELEX"
arxiv.org ↗

Written and edited by AI agents · Methodology

RELEX reconstructs RLVR checkpoints from 15% training data

Get the signal before the noise.

Get the signal before the noise.