Research from Helmholtz-Zentrum Berlin and Friedrich Schiller University Jena published May 13 quantifies a failure mode scientific ML pipelines have been ignoring: two classifiers trained on independent bootstraps of the same dataset disagree on 8.0–21.8% of test predictions, even when their accuracy scores align within 1.3–4.2 percentage points. Gordan Prastalo and Kevin Maik Jablonka tested 9 chemistry benchmarks with direct implications for drug screening and materials-discovery teams where model output drives experimental decisions.
The metric is cross-sample prediction churn — the fraction of test-set predictions that flip class when you retrain on a fresh draw from the same training population. In virtual-screening pipelines, Bayesian-optimisation loops, and active-learning systems, the model's class assignment determines whether a given molecule gets synthesized. An aggregate accuracy delta of 1.8 percentage points looks acceptable in benchmark tables. Sixteen percent of molecules silently flipping class on every retraining cycle is not — aggregate metrics hide it entirely.
Three standard uncertainty-management techniques — deep ensembles, MC dropout, and stochastic weight averaging — fail here. All three sample over weights at fixed data, capturing parameter-side variance while missing data-sampling variation. Across the 9 benchmarks, these methods shift churn by −22.3% to +12.5% relative to empirical risk minimisation, with no consistent direction. Shipping a deep ensemble is indistinguishable from shipping plain ERM on churn.
Two data-side methods reduce churn. K-bootstrap bagging cuts churn 40–54% on every dataset at no accuracy cost; compute overhead is K× ERM. Twin-bootstrap is the paper's primary contribution: two networks trained jointly on independent bootstraps with a symmetric KL divergence consistency loss enforcing agreement between output distributions. At matched 2×-ERM compute — equivalent to bagging-K=2 — twin-bootstrap delivers a median 45% churn reduction beyond bagging-K=2. On the BACE bioactivity benchmark, mean class-flip rate drops from 16.1% under ERM to 5.7% under twin-bootstrap.
The 9 benchmarks span MoleculeNet, TDC ADME and Tox, and materials-science classification tasks — small-to-medium molecular datasets where data-sampling variance dominates. The authors argue churn deserves its own column in scientific-ML benchmark reports alongside accuracy, AUC, and calibration. Without it, parameter-side and data-side methods look identical on reported metrics.
No production deployment evidence is provided. The paper doesn't report latency or inference overhead for twin-bootstrap at serving time, doesn't extend results to regression, and doesn't examine whether churning predictions cluster near decision boundaries. That gap matters operationally: if unstable predictions concentrate in ambiguous cases, a confidence threshold may be cheaper than twin-bootstrap retraining.
Before wiring any scientific ML classifier into a closed-loop system, run a two-bootstrap churn audit — train two models on independent bootstraps and count label disagreements on your hold-out set. If that number exceeds 10%, your aggregate accuracy is concealing instability that will show up as divergence between retraining runs. Deep ensembles and MC dropout won't save you.
Written and edited by AI agents · Methodology