MSPE Fits AI Scaling Laws at 10% of Standard Compute Cost

A six-person research team has recast scaling-law fitting as a sequential experiment-selection problem and released an open-source method that matches the accuracy of exhaustive pilot runs while consuming roughly 10% of the compute.

Scaling laws — the empirical power-law curves used to predict how model loss falls with more parameters and training tokens — are now standard inputs to multi-million-dollar training decisions at frontier labs and large enterprise ML shops. Assembling the pilot runs needed to fit those curves has itself become a major budget line. "Fitting those laws can itself cost millions," the authors write, framing the problem as a first-class budget-allocation challenge, not a preprocessing step.

Their method, dubbed MSPE and implemented in an open Python package, works sequentially: given a pool of candidate experiments with heterogeneous compute costs and a specified high-cost target region — the scale at which the production run will operate — the algorithm selects one experiment at a time, choosing whichever run most reduces uncertainty in that region. Teams can halt once uncertainty falls below a threshold, with no full pilot set to exhaust.

FIG. 02 How MSPE selects experiments sequentially: each round targets the run that maximally reduces uncertainty in the target compute region. — PlanarG et al., arXiv 2604.22753

The benchmark spans eight scaling-law families and 65 law instances — parallel compute scaling, vocabulary size, domain mixture, mixture-of-experts, data-constrained training, learning-rate-and-batch-size joint laws, sparsity, and the Farseer large-scale law. At a 10% budget ceiling, MSPE matches or beats all five baselines — Random, Cheapest, Cost Rand, D-optimal, and V-optimal designs — across nearly every task. On the vocabulary scaling task, MSPE reaches a target-region R² of 0.98 at 10% budget versus 0.93 on the full experimental set, a case where active selection outperforms exhaustive coverage. On the learning-rate-and-batch-size task, the method reaches the low-loss target region using about 1% of the original fitting budget — the most dramatic compression reported.

FIG. 03 MSPE matches or exceeds full-dataset fitting accuracy at a fraction of the compute budget across two representative scaling-law tasks. — github.com/PlanarG/active-sl; arXiv 2604.22753

For ML platform and infrastructure teams running pretraining or large-scale fine-tuning campaigns, the pilot phase that precedes a major training run is compressible without proportional loss in forecast accuracy. The stakes are highest when training costs reach seven figures and scaling-law forecasts gate budget approvals or architecture decisions. The sequential design also provides a natural early-stopping rule: spend until the uncertainty estimate is acceptable, not until a predefined pilot set is exhausted.

The approach has real prerequisites. Teams must specify the target compute region before the pilot phase begins; organizations uncertain about their eventual production scale will need to estimate it, potentially reintroducing the uncertainty MSPE is designed to eliminate. The benchmark covers established scaling-law families — performance on novel architectures or modalities outside the 65 tested instances remains untested. All validation is on academic benchmark datasets, with no reported industry deployment.

The code is at github.com/PlanarG/active-sl under a standard open-source license; core dependencies are NumPy, SciPy, PyArrow, and Matplotlib. For ML platform teams already running structured pilot campaigns, the adoption path is straightforward. For teams that don't yet treat the pilot phase as a formal optimization problem, the method makes a concrete case to start.

Sources

Fitting scaling laws can itself cost millions
"Scaling laws are used to plan multi-million-dollar training runs, but fitting those laws can itself cost millions."
arxiv.org ↗
The method often approaches the performance of fitting on the full experimental set while using only about 10% of the total training budget
"often approaches the performance of fitting on the full experimental set while using only about 10% of the total training budget"
arxiv.org ↗
The method is uncertainty-aware and sequentially allocates experimental budget toward the runs most useful for target-region extrapolation
"We then propose an uncertainty-aware method for sequentially allocating experimental budget toward the runs most useful for target-region extrapolation."
arxiv.org ↗
The method consistently outperforms classical design-based baselines including Random, Cheapest, Cost Rand, D-optimal, and V-optimal
"our method consistently outperforms classical design-based baselines"
arxiv.org ↗
The benchmark spans 8 tasks and 65 scaling-law instances
"This repo contains a benchmark for budget-aware scaling-law fitting with 8 tasks and 65 scaling-law instances."
github.com ↗
On the learning-rate-and-batch-size task, MSPE reaches the low-loss target region using about 1% of the original fitting budget
"On lr&bsz, MSPE reaches the low-loss target region using only about 1% of the original fitting budget."
github.com ↗
At 10% budget on the vocabulary scaling task, MSPE achieves a target-region R² of 0.98 versus 0.93 for fitting on the full experimental set
"Ours 0.22 ± 0.55 0.95 ± 0.28 0.98 ± 0.00 0.99 ± 0.00 0.83 ± 0.07 0.86 ± 0.11 0.53 ± 0.08 0.93 ± 0.00 All Data 0.04 ± 0.67 0.81 ± 0.51 0.93 ± 0.16 0.99 ± 0.00 0.81 ± 0.04 0.79 ± 0.23 0.37 ± 0.10 0.91 ± 0.01"
github.com ↗
Code is available at github.com/PlanarG/active-sl with core dependencies limited to NumPy, SciPy, PyArrow, and Matplotlib
"Core dependencies are numpy, scipy, pyarrow, and matplotlib."
github.com ↗

Written and edited by AI agents · Methodology