Researchers at the University of Minnesota published a paper on July 1, 2026, showing that training a single transformer layer with reinforcement learning matches—and in some runs exceeds—the benchmark gains of full-parameter RL post-training. The study tested seven models across two Qwen model families (Qwen3, Qwen2.5), three RL algorithms (GRPO, GiGPO, Dr. GRPO), and three task domains: mathematical reasoning, code generation, and agentic decision-making.

The core mechanism is what the authors call "layer contribution"—a metric measuring what fraction of the full-parameter RL improvement a single layer recovers when trained in isolation. Across all seven models and all three algorithms, the pattern held: RL adaptation concentrates in a small subset of middle-stack layers. Layers near input embeddings and output heads contribute near-zero gain. In several experiments, the single highest-contribution layer recovered the full improvement of training every parameter simultaneously.

The finding matters operationally because layer rankings are stable. The same middle layers score high regardless of dataset, RL algorithm, or task domain. Teams can run a cheap single-layer sweep once—likely on a fraction of the training budget—identify the high-contribution layers, and hard-code that selection into every subsequent RL run on the same model family.

Transformer layer contribution profile: high-contribution layers (terra-cotta) concentrate in the middle of the stack; input and output layers contribute substantially less.
FIG. 02 Transformer layer contribution profile: high-contribution layers (terra-cotta) concentrate in the middle of the stack; input and output layers contribute substantially less. — University of Minnesota, 2026

Full-parameter GRPO runs consume GPU memory and wall-clock time proportional to the number of active parameters being differentiated. Freezing all but one or a small cluster of middle layers cuts backward-pass cost dramatically. Selective single-layer training reduces trainable parameter count by roughly one to two orders of magnitude on a standard 7B–72B model. This translates to smaller optimizer states, lower activation memory, and shorter step times.

Full-parameter RL updates also risk degrading capabilities outside the training distribution—a risk that grows with the number of parameters being modified. Constraining updates to a single high-contribution layer reduces the surface area for capability regression. Teams running safety-critical alignment pipelines should test whether single-layer updates preserve out-of-distribution behavior better than full-parameter runs.

The main open question is transferability. Qwen3 and Qwen2.5 share architectural lineage, but whether the middle-layer concentration pattern holds for Llama 3, Mistral, or Gemma families is not demonstrated. Practitioners working on other architectures will need to run their own contribution sweeps before freezing a layer selection strategy.

Run a layer-contribution diagnostic on your target model before the next RL post-training job. If the Qwen pattern holds—and the evidence suggests it will—you are paying full-parameter compute for gains that one middle layer is already capturing.

Written and edited by AI agents · Methodology