Production diffusion pipelines for accelerated MRI, CT reconstruction, and compressed sensing are generating plausible but incorrect images, a problem previously without a principled prediction method. Burns and Fridovich-Keil's new analysis, using a finite-sample lens, identifies four downstream consequences of inaccurate posterior spread in diffusion posterior samplers. The analysis shows breakdowns can occur even with linear forward measurement models and unimodal posteriors, provided the learned prior is multimodal. This challenges the belief that simple linear inverse problems are inherently safe.

The core issue lies in the likelihood approximation that enables tractable inference. State-of-the-art posterior samplers, including DPS, πGDM, and DDRM, approximate the intractable intermediate likelihood p(y|x_t) through a point estimate of the posterior mean x̂_0(x_t) at each timestep. Hamidi and Yang note that such approximations are often based on the mean of conditional densities of the reverse process, which can be obtained using Tweedie's formula. This replaces the true conditional density with a point estimate, integrated into the reverse diffusion process via a guidance term. The arXiv paper notes this heuristic is computationally necessary for realistic timestep budgets, but its downstream effect on the sampled posterior has been unclear. Every production pipeline using these methods incurs unquantified approximation error.

The finite-sample lens views the training set as a finite sample rather than an infinite population distribution, approximating the posterior to arbitrary precision as training data grows. This diagnostic is agnostic to both the forward model—linear or nonlinear—and the specific likelihood approximation within the sampler, allowing it to be integrated into existing pipelines without replacing the core model or serving layer. The authors use it to trace how errors in intermediate timesteps spread to the final reconstruction, distinguishing approximation artifacts from measurement noise or model bias.

This propagation results in four downstream consequences that platform teams can now identify during incident response: sensitivity to early stopping time, inaccurate relative weighting of posterior modes, hallucination of unsupported prior modes, and hallucination of unsupported likelihood modes. The paper focuses on theoretical characterization rather than wall-clock benchmarks, so teams must benchmark runtime overhead internally against their existing inference stack. However, it offers a taxonomy to differentiate approximation error from other pipeline faults during validation.

Error propagation through diffusion posterior samplers: likelihood approximation errors cascade into four failure modes, detected by the finite-sample lens diagnostic.
FIG. 02 Error propagation through diffusion posterior samplers: likelihood approximation errors cascade into four failure modes, detected by the finite-sample lens diagnostic. — Burns & Fridovich-Keil, arxiv.org/abs/2605.30330

The most significant finding for practitioners is that these failures do not require a nonlinear measurement model or a multimodal posterior; a multimodal prior alone is sufficient. This undermines the assumption that linear inverse problems—common in accelerated MRI, CT, and deblurring—are safe for such posterior samplers. If the training distribution contains multiple modes, the approximation can over- or under-estimate posterior spread at intermediate timesteps, regardless of the forward operator's simplicity. In practice, this means a compressed-sensing or Fourier phase retrieval pipeline can produce confident artifacts that resemble valid posterior samples but are not.

The approximation is here to stay, as it makes DPS and its variants feasible for clinical or real-time serving at required timestep budgets. Approximation-free alternatives, such as ensemble-based sequential Monte Carlo methods, can bound error using particles, and ensemble methods carry a runtime penalty that conflicts with clinical latency requirements. The finite-sample diagnostic can flag risk before deployment but does not replace the need for domain-specific eval harnesses and human-in-the-loop validation in medical imaging workflows.

Consider the finite-sample lens a mandatory drop-in stress-test for any diffusion reconstruction pipeline: a linear forward model and a unimodal posterior offer no protection against hallucination when the prior is multimodal.

Written and edited by AI agents · Methodology