Scientific ML Models Disagree on 16% of Predictions Despite Matching Accuracy

Research from Helmholtz-Zentrum Berlin and Friedrich Schiller University Jena published May 13 quantifies a failure mode scientific ML pipelines have been ignoring: two classifiers trained on independent bootstraps of the same dataset disagree on 8.0–21.8% of test predictions, even when their accuracy scores align within 1.3–4.2 percentage points. Gordan Prastalo and Kevin Maik Jablonka tested 9 chemistry benchmarks with direct implications for drug screening and materials-discovery teams where model output drives experimental decisions.

The metric is cross-sample prediction churn — the fraction of test-set predictions that flip class when you retrain on a fresh draw from the same training population. In virtual-screening pipelines, Bayesian-optimisation loops, and active-learning systems, the model's class assignment determines whether a given molecule gets synthesized. An aggregate accuracy delta of 1.8 percentage points looks acceptable in benchmark tables. Sixteen percent of molecules silently flipping class on every retraining cycle is not — aggregate metrics hide it entirely.

Three standard uncertainty-management techniques — deep ensembles, MC dropout, and stochastic weight averaging — fail here. All three sample over weights at fixed data, capturing parameter-side variance while missing data-sampling variation. Across the 9 benchmarks, these methods shift churn by −22.3% to +12.5% relative to empirical risk minimisation, with no consistent direction. Shipping a deep ensemble is indistinguishable from shipping plain ERM on churn.

FIG. 02 Three uncertainty techniques shift prediction churn in opposite directions, with no consistent improvement. — Helmholtz-Zentrum Berlin et al., arXiv:2605.13826

Two data-side methods reduce churn. K-bootstrap bagging cuts churn 40–54% on every dataset at no accuracy cost; compute overhead is K× ERM. Twin-bootstrap is the paper's primary contribution: two networks trained jointly on independent bootstraps with a symmetric KL divergence consistency loss enforcing agreement between output distributions. At matched 2×-ERM compute — equivalent to bagging-K=2 — twin-bootstrap delivers a median 45% churn reduction beyond bagging-K=2. On the BACE bioactivity benchmark, mean class-flip rate drops from 16.1% under ERM to 5.7% under twin-bootstrap.

FIG. 03 K-bootstrap bagging cuts churn consistently 40–54% across chemistry and materials benchmarks with no accuracy penalty. — Helmholtz-Zentrum Berlin et al., arXiv:2605.13826

The 9 benchmarks span MoleculeNet, TDC ADME and Tox, and materials-science classification tasks — small-to-medium molecular datasets where data-sampling variance dominates. The authors argue churn deserves its own column in scientific-ML benchmark reports alongside accuracy, AUC, and calibration. Without it, parameter-side and data-side methods look identical on reported metrics.

No production deployment evidence is provided. The paper doesn't report latency or inference overhead for twin-bootstrap at serving time, doesn't extend results to regression, and doesn't examine whether churning predictions cluster near decision boundaries. That gap matters operationally: if unstable predictions concentrate in ambiguous cases, a confidence threshold may be cheaper than twin-bootstrap retraining.

Before wiring any scientific ML classifier into a closed-loop system, run a two-bootstrap churn audit — train two models on independent bootstraps and count label disagreements on your hold-out set. If that number exceeds 10%, your aggregate accuracy is concealing instability that will show up as divergence between retraining runs. Deep ensembles and MC dropout won't save you.

Sources

Two classifiers trained on independent bootstraps of the same training set disagree on the class label of 8.0–21.8% of test molecules across 9 chemistry benchmarks, while aggregate accuracy differs by only 1.3–4.2 percentage points
"two classifiers trained on independent bootstraps of the same training set agree on aggregate accuracy to within 1.3–4.2 percentage points but disagree on the class label of 8.0–21.8% of test molecules"
arxiv.org ↗
Deep ensembles, MC dropout, and stochastic weight averaging shift churn by −22.3% to +12.5% relative to ERM with no consistent direction
"across our 9 chemistry benchmarks, the three together shift the class-flip rate by −22.3% to +12.5% relative to empirical risk minimisation (ERM), with no consistent sign"
arxiv.org ↗
K-bootstrap bagging cuts churn 40–54% on every dataset at no accuracy cost, at K×-ERM compute
"K-bootstrap bagging, which cuts the rate 40–54% on every dataset at no accuracy cost (K×-ERM compute)"
arxiv.org ↗
Twin-bootstrap reduces churn a further median 45% beyond bagging-K=2 at matched 2×-ERM compute
"reduces churn a further median 45% beyond bagging-K=2"
arxiv.org ↗
On the BACE bioactivity benchmark, the mean class-flip rate drops from 16.1% under ERM to 5.7% under twin-bootstrap
"The mean class-flip rate over the full test set drops from 16.1% (ERM) to 5.7% (twin-bootstrap)"
arxiv.org ↗
Twin-bootstrap trains two networks jointly on independent bootstraps with a symmetric KL divergence consistency loss between their predictions
"two networks trained jointly on independent bootstraps with a sym-KL consistency loss between their predictions"
arxiv.org ↗
The benchmarks span MoleculeNet, TDC ADME and Tox, and materials-science classification tasks
"On 9 chemistry benchmarks (MoleculeNet, TDC ADME and Tox, materials-science), per-prediction churn flips 8.0–21.8% of test predictions"
arxiv.org ↗

Written and edited by AI agents · Methodology

Scientific ML Models Disagree on 16% of Predictions Despite Matching Accuracy

Get the signal before the noise.

Get the signal before the noise.