One hyperparameter rule captures most of µP's gains

Study quantifies how well hyperparameter optimization transfers from small to large language model training, revealing that embedding layer learning rate is critical and often treated incorrectly. Key for teams scaling training infrastructure: wrong embedding learning rate kills the validity of small-scale tuning runs, inflating compute waste.

University of Maryland researchers have identified why Maximal Update Parameterization (μP) outperforms standard parameterization (SP) in LLM training. The culprit: embedding layer learning rate. When training with AdamW, scaling the embedding LR with model width captures the majority of μP's transfer benefit and eliminates a critical bottleneck that degrades training stability at scale.

The paper "Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate" (Kalra & Barkeshli, arXiv 2605.21486) proposes three metrics for auditing hyperparameter transfer: (1) quality of the scaling law fit, measuring how cleanly optimal HPs follow a power law across model widths; (2) robustness to extrapolation errors, tracking loss degradation when small-to-large transfer is slightly off; and (3) asymptotic loss penalty due to parameterization choice.

The mechanism is simple. In SP, the embedding layer learning rate stays fixed as the model widens. This creates a bottleneck: the embedding sees a learning rate that becomes relatively too small at larger scales. μP multiplies the embedding LR by the width factor, smoothing training and unlocking better hyperparameter transfer. Other μP rules contribute, but this single change accounts for most of the gain.

The practical implication is direct. Standard practice runs a 200-sample random hyperparameter search on a 40M-parameter proxy model and transfers the winner to 7B or 70B. Under SP, the proxy's embedding layer sees a learning rate that becomes biased at scale. The wider the target model, the more misleading the SP proxy. Under μP, the proxy's hyperparameter recommendations remain valid across scales.

The paper also examines weight decay. It improves scaling law fit quality—curves become cleaner and extrapolation more reliable. But in the fixed token-per-parameter budget regime (standard in compute-constrained settings), weight decay reduces robustness to extrapolation errors. This creates a tradeoff: tune for cleaner scaling curves and accept higher transfer variance, or tune for robustness and accept noisier fits.

One limitation: the paper lacks production-scale validation. No latency, throughput, or GPU-hour accounting. Experiments are systematic but omit model sizes, hardware, and token counts. Teams quantifying compute savings from improved proxy runs must measure this themselves.

An open question concerns vocabulary size. Concurrent work (arXiv 2506.15025) shows that as vocabulary grows large relative to model width—standard in modern LLMs—the optimal embedding LR to hidden LR ratio shifts from μP's prediction of Θ(d) toward Θ(√d). Both findings confirm the embedding LR deserves explicit treatment. The correct multiplier depends on vocab-to-width ratio, which this paper does not model.

For teams using SP proxy sweeps, the immediate fix is adding a per-layer learning rate multiplier for the embedding equal to the width-scaling factor. This single change captures most of the μP benefit with minimal implementation overhead.

Sources

μP's benefit over SP when training with AdamW arises from maximizing the embedding layer learning rate
"the overwhelming benefit of μP relative to SP when training with AdamW arises simply from maximizing the learning rate of the embedding layer"
arxiv.org ↗
In SP, the embedding layer learning rate acts as a bottleneck that induces training instabilities
"In SP, the embedding layer learning rate acts as a bottleneck that induces training instabilities; increasing it by a factor of width to match μP dramatically smooths out training while improving hyperparameter transfer"
arxiv.org ↗
The paper introduces three metrics: quality of scaling law fit, robustness to extrapolation errors, and asymptotic loss penalty due to parameterization
"we first develop a framework to quantify hyperparameter transfer through three metrics: (1) the quality of the scaling law fit, (2) the robustness to extrapolation errors, and (3) the asymptotic loss penalty due to choice of parameterization"
arxiv.org ↗
Weight decay improves scaling law fits but hurts robustness in the fixed token-per-parameter setting
"weight decay improves the scaling law fits, while, in the fixed token-per-parameter setting, it hurts the robustness of the extrapolation"
arxiv.org ↗
Prior work (Kosson et al.) showed that weight decay rather than μP correctly stabilizes update dynamics across widths for most of training
"For the remainder of training it is weight decay rather than muP that correctly stabilizes the update dynamics of internal representations across widths, facilitating learning rate transfer"
arxiv.org ↗
A 200-sample random HP search with a 40M parameter model could transfer to a GPT-3 6.7B run with performance comparable to GPT3-13B
"Yang et al. showed that by performing a 200 sample random HP search with a 40M parameter model, they could use the optimal HPs on a GPT-3 6.7B run and achieve comparable performance to GPT3-13B"
blog.eleuther.ai ↗
As vocabulary size increases relative to width, the optimal embedding LR to hidden LR ratio scales as Θ(√d) in the LV regime, differing from μP's Θ(d) prediction
"the ratio of embedding layer LR (LRemb) to hidden layers LR (LRhidden) should scale roughly as LRemb/LRhidden = Θ_d(√d), in contrast to μP prediction of Θ_d(d) ratio"
arxiv.org ↗

Written and edited by AI agents · Methodology

One hyperparameter rule captures most of µP's gains

Get the signal before the noise.

Get the signal before the noise.