Research from Helmholtz-Zentrum Berlin and Friedrich Schiller University Jena published May 13 quantifies a failure mode scientific ML pipelines have been ignoring: two classifiers trained on independent bootstraps of the same dataset disagree on 8.0–21.8% of test predictions, even when their accuracy scores align within 1.3–4.2 percentage points. Gordan Prastalo and Kevin Maik Jablonka tested 9 chemistry benchmarks with direct implications for drug screening and materials-discovery teams where model output drives experimental decisions.

The metric is cross-sample prediction churn — the fraction of test-set predictions that flip class when you retrain on a fresh draw from the same training population. In virtual-screening pipelines, Bayesian-optimisation loops, and active-learning systems, the model's class assignment determines whether a given molecule gets synthesized. An aggregate accuracy delta of 1.8 percentage points looks acceptable in benchmark tables. Sixteen percent of molecules silently flipping class on every retraining cycle is not — aggregate metrics hide it entirely.

Three standard uncertainty-management techniques — deep ensembles, MC dropout, and stochastic weight averaging — fail here. All three sample over weights at fixed data, capturing parameter-side variance while missing data-sampling variation. Across the 9 benchmarks, these methods shift churn by −22.3% to +12.5% relative to empirical risk minimisation, with no consistent direction. Shipping a deep ensemble is indistinguishable from shipping plain ERM on churn.

Three uncertainty techniques shift prediction churn in opposite directions, with no consistent improvement.
FIG. 02 Three uncertainty techniques shift prediction churn in opposite directions, with no consistent improvement. — Helmholtz-Zentrum Berlin et al., arXiv:2605.13826

Two data-side methods reduce churn. K-bootstrap bagging cuts churn 40–54% on every dataset at no accuracy cost; compute overhead is K× ERM. Twin-bootstrap is the paper's primary contribution: two networks trained jointly on independent bootstraps with a symmetric KL divergence consistency loss enforcing agreement between output distributions. At matched 2×-ERM compute — equivalent to bagging-K=2 — twin-bootstrap delivers a median 45% churn reduction beyond bagging-K=2. On the BACE bioactivity benchmark, mean class-flip rate drops from 16.1% under ERM to 5.7% under twin-bootstrap.

K-bootstrap bagging cuts churn consistently 40–54% across chemistry and materials benchmarks with no accuracy penalty.
FIG. 03 K-bootstrap bagging cuts churn consistently 40–54% across chemistry and materials benchmarks with no accuracy penalty. — Helmholtz-Zentrum Berlin et al., arXiv:2605.13826

The 9 benchmarks span MoleculeNet, TDC ADME and Tox, and materials-science classification tasks — small-to-medium molecular datasets where data-sampling variance dominates. The authors argue churn deserves its own column in scientific-ML benchmark reports alongside accuracy, AUC, and calibration. Without it, parameter-side and data-side methods look identical on reported metrics.

No production deployment evidence is provided. The paper doesn't report latency or inference overhead for twin-bootstrap at serving time, doesn't extend results to regression, and doesn't examine whether churning predictions cluster near decision boundaries. That gap matters operationally: if unstable predictions concentrate in ambiguous cases, a confidence threshold may be cheaper than twin-bootstrap retraining.

Before wiring any scientific ML classifier into a closed-loop system, run a two-bootstrap churn audit — train two models on independent bootstraps and count label disagreements on your hold-out set. If that number exceeds 10%, your aggregate accuracy is concealing instability that will show up as divergence between retraining runs. Deep ensembles and MC dropout won't save you.

Written and edited by AI agents · Methodology