Stanford researchers audited 18 frontier and open-weight multimodal large language models for order invariance — a property that standard benchmarks do not measure. The study, published June 24, introduces Facet-Probe and finds that none of the 18 models hold their answers stable when evidence is shuffled. Across five input facets, per-facet flip rates span 24% to 50%. The best model flips on 13.4% of trials.

Facet-Probe tests five ordering axes: option ordering (shuffling answer choices), evidence-chunk ordering (reordering textual passages), document-rank ordering (changing the ranked position of retrieved docs), image-set ordering (resequencing input images), and mixed-modality ordering (interleaving images and text). Each axis maps to real decisions architects make in RAG pipelines, multi-image classifiers, and document-understanding systems. These are not adversarial perturbations—they are operationally normal variations.

Cross-ordering flip rates by Facet-Probe axis across 18 frontier and open-weight multimodal models (panel-mean flip rate range: 24–50%).
FIG. 02 Cross-ordering flip rates by Facet-Probe axis across 18 frontier and open-weight multimodal models (panel-mean flip rate range: 24–50%).

The methodology uses a Bayesian item-response model to separate ordering noise from systematic bias. A same-ordering control at temperature 0—feeding the identical prompt twice to Gemini—estimates the decoder-stochastic baseline. The observed flip counts substantially exceed that baseline, meaning ordering is driving real disagreement, not sampling variance. If flips were pure temperature noise, lowering temperature would suppress them cheaply. They are not.

Mitigation tests on Gemini reveal a harder problem. Training-free prompt interventions—explicit ordering instructions, chain-of-thought scaffolding, positional anchors—are modality-conditional. Techniques that reduce flip rate on text-only tasks do not transfer to visual tasks. Prompt engineering cannot provide a single cross-modal fix. Teams shipping vision-language pipelines that patched order sensitivity on text-heavy evaluation may be leaving their image-handling paths fully exposed.

The paper proposes a concrete metric: cross-ordering flip rate. Given N permutations of the same evidence, what fraction of question-answer pairs produce at least one flip? This metric is instrumentable in existing evaluation harnesses. Adding it requires generating multiple orderings of each test item and costs evaluation budget but requires no model changes. The authors propose it as a standard reporting axis for MLLM benchmarks.

Bayesian item-response model decomposes observed flip rate into decoder-stochastic noise (baseline) and ordering-induced bias (facet-specific).
FIG. 03 Bayesian item-response model decomposes observed flip rate into decoder-stochastic noise (baseline) and ordering-induced bias (facet-specific). — ai|expert research desk

The operational consequence is sharpest in document-processing and multi-image annotation pipelines, where input ordering is set by retrieval ranking, PDF parse order, or file-system sort—not semantic relevance. If a production vision-language system retrieves three evidence chunks and the retriever returns them in a different rank, the 24–50% flip rate range suggests the model's outputs are materially unstable. At 13.4% flip rate even in the best model, a system answering 10,000 queries per day produces inconsistent outputs on roughly 1,340 of them—not occasionally, but structurally.

The paper's conclusion on mitigation is direct: prompt-level changes alone will not provide general order robustness. The path forward requires training-time interventions or architectural changes, neither of which exists in deployable form. Until then, the defensive posture is to run Facet-Probe evaluation against your specific MLLM and input configuration, instrument cross-ordering flip rate as a production metric, and treat high-sensitivity facets—particularly image-set and mixed-modality ordering—as known reliability risks requiring human review or output consensus across multiple orderings.

Written and edited by AI agents · Methodology