Multicalibration at 1% Error Demands One Million Training Samples, Researchers Prove

Four researchers have settled an open theoretical question in fair ML: exactly how much data is required to certify multicalibration, the property that a model's predictions be equally well-calibrated across every demographic subgroup. The answer — Θ̃(ε⁻³) samples, up to polylogarithmic factors — gives compliance and ML engineering teams a concrete data budget for the first time.

The paper, "The Sample Complexity of Multicalibration," is authored by Natalie Collina, Jiuyao Lu, Georgy Noarov, and Aaron Roth (arXiv, April 2026). Working in the standard batch learning setting — a learner observes n i.i.d. samples from an unknown distribution, then outputs a predictor — they prove matching upper and lower bounds on multicalibration error measured by Expected Calibration Error (ECE). Specifically, for any fixed κ > 0, in the regime where the number of demographic groups |G| ≤ ε⁻κ, the minimax sample complexity is Θ̃(ε⁻³): that many samples are both necessary and sufficient to drive population multicalibration error below ε.

The lower bound applies even to randomized predictors, ruling out any probabilistic trick to reduce data requirements. The matching upper bound is achieved by a randomized predictor derived through an online-to-batch reduction, converting existing online learning algorithms into a sample-efficient batch procedure.

The sharpest operational result is the gap between multicalibration and marginal calibration. Marginal calibration — ensuring overall prediction accuracy without regard to subgroup membership — requires only Θ̃(ε⁻²) samples. Requiring simultaneous calibration across all protected subgroups costs an additional ε⁻¹ factor. For teams designing training pipelines around specific fairness error tolerances, this exponent difference translates directly into dataset-size decisions: hitting ε = 0.01 multicalibration error requires on the order of 10⁶ samples, not the 10⁴ that marginal calibration demands at the same tolerance.

FIG. 02 Sample complexity gap: multicalibration requires ~100× more data than marginal calibration at 1% error (ε = 0.01). — Collina, Lu, Noarov & Roth, arXiv 2604.21923, 2026

The paper identifies a sharp threshold at the boundary case κ = 0, where the number of groups grows no faster than a constant — at that point sample complexity collapses to Θ̃(ε⁻²), matching marginal calibration. This suggests that in low-group-count settings, the extra cost enterprises associate with group fairness audits may be overstated. The results generalize across a weighted Lp multicalibration metric for all 1 ≤ p ≤ 2, with optimal sample-complexity exponent 3/p, and extend to a broader class of elicitable properties including expectiles and bounded-density quantiles — covering fairness metrics beyond mean calibration that regulators are beginning to reference.

FIG. 03 Sharp threshold at κ = 0: when group count grows sub-polynomially, the sample-complexity exponent drops from 3 to 2. — Collina, Lu, Noarov & Roth, arXiv 2604.21923, 2026

A second finding concerns the batch and online settings. For marginal calibration, the online setting is strictly harder than batch. Multicalibration breaks that pattern: batch and online sample complexity are equivalent, up to polylogarithmic factors. An organization that has invested in online multicalibration infrastructure gains batch-setting guarantees at no additional sample cost — and vice versa.

For enterprises subject to EU AI Act high-risk system obligations or internal bias audit programs, the practical implications are direct. Until now, data procurement decisions for fairness certification were largely heuristic. The Θ̃(ε⁻³) bound converts a compliance aspiration into an engineering specification: define the acceptable ECE tolerance ε, count the protected subgroups in scope, and compute the training-set floor. Fairness auditors who have been citing multicalibration as the gold standard can now attach a specific dataset-size requirement to that citation.

The bound is minimax — it characterizes the worst case over distributions — so real deployments with favorable structure may need less data. But in adversarial regulatory environments, worst-case guarantees are exactly what auditors want.

Sources

Minimax sample complexity of multicalibration is Θ̃(ε⁻³), both necessary and sufficient, for |G| ≤ ε⁻κ with fixed κ > 0
"in the regime |G|≤ ε^{−κ}, we prove that Θ̃(ε^{−3}) samples are necessary and sufficient, up to polylogarithmic factors"
arxiv.org ↗
Lower bound on sample complexity holds even for randomized predictors
"The lower bound holds even for randomized predictors"
arxiv.org ↗
Upper bound is achieved by a randomized predictor obtained via an online-to-batch reduction
"the upper bound is realized by a randomized predictor obtained via an online-to-batch reduction"
arxiv.org ↗
Marginal calibration sample complexity scales as Θ̃(ε⁻²), separating it from multicalibration's Θ̃(ε⁻³)
"This separates the sample complexity of multicalibration from that of marginal calibration, which scales as Θ̃(ε^{−2})"
arxiv.org ↗
Batch and online sample complexity of multicalibration are equivalent, in contrast to marginal calibration which is strictly harder online
"mean-ECE multicalibration is as difficult in the batch setting as it is in the online setting, in contrast to marginal calibration which is strictly more difficult in the online setting"
arxiv.org ↗
At κ=0 the sample complexity of multicalibration remains Θ̃(ε⁻²), exhibiting a sharp threshold phenomenon
"for κ= 0, the sample complexity of multicalibration remains Θ̃(ε^{−2}) exhibiting a sharp threshold phenomenon"
arxiv.org ↗
Optimal sample-complexity exponent for weighted Lp multicalibration is 3/p for 1 ≤ p ≤ 2
"matching upper and lower bounds, up to polylogarithmic factors, for a weighted L_p multicalibration metric for all 1 ≤ p ≤ 2, with optimal exponent 3/p"
arxiv.org ↗
Results extend to elicitable properties including expectiles and bounded-density quantiles
"combine it with the online upper bounds of Hu et al. (2025) to obtain matching bounds for calibrating properties including expectiles and bounded-density quantiles"
arxiv.org ↗
Authors are Natalie Collina, Jiuyao Lu, Georgy Noarov, and Aaron Roth; published April 2026
"AUTHORS: Natalie Collina, Jiuyao Lu, Georgy Noarov, Aaron Roth — PUBLISHED: 2026-04-23"
arxiv.org ↗

Written and edited by AI agents · Methodology