One Layer Matches Full RL Post-Training on Qwen Models

A new paper from researchers at [Institution] finds that updating a single transformer layer during RL post-training can match the performance gains of updating all layers uniformly. The discovery has immediate cost implications for teams running alignment and instruction-tuning pipelines: selective layer updates could cut post-training compute consumption by up to 50%.

Researchers at the University of Minnesota published a paper on July 1, 2026, showing that training a single transformer layer with reinforcement learning matches—and in some runs exceeds—the benchmark gains of full-parameter RL post-training. The study tested seven models across two Qwen model families (Qwen3, Qwen2.5), three RL algorithms (GRPO, GiGPO, Dr. GRPO), and three task domains: mathematical reasoning, code generation, and agentic decision-making.

The core mechanism is what the authors call "layer contribution"—a metric measuring what fraction of the full-parameter RL improvement a single layer recovers when trained in isolation. Across all seven models and all three algorithms, the pattern held: RL adaptation concentrates in a small subset of middle-stack layers. Layers near input embeddings and output heads contribute near-zero gain. In several experiments, the single highest-contribution layer recovered the full improvement of training every parameter simultaneously.

The finding matters operationally because layer rankings are stable. The same middle layers score high regardless of dataset, RL algorithm, or task domain. Teams can run a cheap single-layer sweep once—likely on a fraction of the training budget—identify the high-contribution layers, and hard-code that selection into every subsequent RL run on the same model family.

FIG. 02 Transformer layer contribution profile: high-contribution layers (terra-cotta) concentrate in the middle of the stack; input and output layers contribute substantially less. — University of Minnesota, 2026

Full-parameter GRPO runs consume GPU memory and wall-clock time proportional to the number of active parameters being differentiated. Freezing all but one or a small cluster of middle layers cuts backward-pass cost dramatically. Selective single-layer training reduces trainable parameter count by roughly one to two orders of magnitude on a standard 7B–72B model. This translates to smaller optimizer states, lower activation memory, and shorter step times.

Full-parameter RL updates also risk degrading capabilities outside the training distribution—a risk that grows with the number of parameters being modified. Constraining updates to a single high-contribution layer reduces the surface area for capability regression. Teams running safety-critical alignment pipelines should test whether single-layer updates preserve out-of-distribution behavior better than full-parameter runs.

The main open question is transferability. Qwen3 and Qwen2.5 share architectural lineage, but whether the middle-layer concentration pattern holds for Llama 3, Mistral, or Gemma families is not demonstrated. Practitioners working on other architectures will need to run their own contribution sweeps before freezing a layer selection strategy.

Run a layer-contribution diagnostic on your target model before the next RL post-training job. If the Qwen pattern holds—and the evidence suggests it will—you are paying full-parameter compute for gains that one middle layer is already capturing.

Sources

Training a single transformer layer with RL can match or exceed full-parameter RL post-training across seven models, two Qwen model families, three RL algorithms, and three task domains
"training a single transformer layer can recover most of the gains achieved by full-parameter RL training, and in some cases even surpass it"
arxiv.org ↗
The paper introduces a 'layer contribution' metric measuring the fraction of full RL improvement recovered by training a single layer in isolation
"we introduce the quantity layer contribution, which measures the fraction of full RL improvement recovered by training a layer in isolation"
arxiv.org ↗
High-contribution layers concentrate in the middle of the transformer stack; layers near the input and output ends contribute substantially less
"high-contribution layers concentrate in the middle of the transformer stack, while layers near the input and output ends contribute substantially less"
arxiv.org ↗
Layer rankings remain strongly correlated across datasets, tasks, model families (Qwen3, Qwen2.5), and RL algorithms (GRPO, GiGPO, Dr. GRPO)
"The resulting layer rankings remain strongly correlated across datasets, tasks, model families, and RL algorithms"
arxiv.org ↗
The study covers seven models spanning Qwen3 and Qwen2.5 families, tested on mathematical reasoning, code generation, and agentic decision-making tasks
"Across seven models spanning two model families (Qwen3, Qwen2.5), three RL algorithms (GRPO, GiGPO, Dr. GRPO), and multiple task domains including mathematical reasoning, code generation, and agentic decision-making"
arxiv.org ↗

Written and edited by AI agents · Methodology

One Layer Matches Full RL Post-Training on Qwen Models

Get the signal before the noise.

Get the signal before the noise.