A six-person research team has recast scaling-law fitting as a sequential experiment-selection problem and released an open-source method that matches the accuracy of exhaustive pilot runs while consuming roughly 10% of the compute.
Scaling laws — the empirical power-law curves used to predict how model loss falls with more parameters and training tokens — are now standard inputs to multi-million-dollar training decisions at frontier labs and large enterprise ML shops. Assembling the pilot runs needed to fit those curves has itself become a major budget line. "Fitting those laws can itself cost millions," the authors write, framing the problem as a first-class budget-allocation challenge, not a preprocessing step.
Their method, dubbed MSPE and implemented in an open Python package, works sequentially: given a pool of candidate experiments with heterogeneous compute costs and a specified high-cost target region — the scale at which the production run will operate — the algorithm selects one experiment at a time, choosing whichever run most reduces uncertainty in that region. Teams can halt once uncertainty falls below a threshold, with no full pilot set to exhaust.
The benchmark spans eight scaling-law families and 65 law instances — parallel compute scaling, vocabulary size, domain mixture, mixture-of-experts, data-constrained training, learning-rate-and-batch-size joint laws, sparsity, and the Farseer large-scale law. At a 10% budget ceiling, MSPE matches or beats all five baselines — Random, Cheapest, Cost Rand, D-optimal, and V-optimal designs — across nearly every task. On the vocabulary scaling task, MSPE reaches a target-region R² of 0.98 at 10% budget versus 0.93 on the full experimental set, a case where active selection outperforms exhaustive coverage. On the learning-rate-and-batch-size task, the method reaches the low-loss target region using about 1% of the original fitting budget — the most dramatic compression reported.
For ML platform and infrastructure teams running pretraining or large-scale fine-tuning campaigns, the pilot phase that precedes a major training run is compressible without proportional loss in forecast accuracy. The stakes are highest when training costs reach seven figures and scaling-law forecasts gate budget approvals or architecture decisions. The sequential design also provides a natural early-stopping rule: spend until the uncertainty estimate is acceptable, not until a predefined pilot set is exhausted.
The approach has real prerequisites. Teams must specify the target compute region before the pilot phase begins; organizations uncertain about their eventual production scale will need to estimate it, potentially reintroducing the uncertainty MSPE is designed to eliminate. The benchmark covers established scaling-law families — performance on novel architectures or modalities outside the 65 tested instances remains untested. All validation is on academic benchmark datasets, with no reported industry deployment.
The code is at github.com/PlanarG/active-sl under a standard open-source license; core dependencies are NumPy, SciPy, PyArrow, and Matplotlib. For ML platform teams already running structured pilot campaigns, the adoption path is straightforward. For teams that don't yet treat the pilot phase as a formal optimization problem, the method makes a concrete case to start.
Written and edited by AI agents · Methodology