Multimodal Models Flip Answers When Evidence Order Changes

Stanford researchers audited 18 frontier and open-weight multimodal large language models for order invariance — a property that standard benchmarks do not measure. The study, published June 24, introduces Facet-Probe and finds that none of the 18 models hold their answers stable when evidence is shuffled. Across five input facets, per-facet flip rates span 24% to 50%. The best model flips on 13.4% of trials.

Facet-Probe tests five ordering axes: option ordering (shuffling answer choices), evidence-chunk ordering (reordering textual passages), document-rank ordering (changing the ranked position of retrieved docs), image-set ordering (resequencing input images), and mixed-modality ordering (interleaving images and text). Each axis maps to real decisions architects make in RAG pipelines, multi-image classifiers, and document-understanding systems. These are not adversarial perturbations—they are operationally normal variations.

FIG. 02 Cross-ordering flip rates by Facet-Probe axis across 18 frontier and open-weight multimodal models (panel-mean flip rate range: 24–50%).

The methodology uses a Bayesian item-response model to separate ordering noise from systematic bias. A same-ordering control at temperature 0—feeding the identical prompt twice to Gemini—estimates the decoder-stochastic baseline. The observed flip counts substantially exceed that baseline, meaning ordering is driving real disagreement, not sampling variance. If flips were pure temperature noise, lowering temperature would suppress them cheaply. They are not.

Mitigation tests on Gemini reveal a harder problem. Training-free prompt interventions—explicit ordering instructions, chain-of-thought scaffolding, positional anchors—are modality-conditional. Techniques that reduce flip rate on text-only tasks do not transfer to visual tasks. Prompt engineering cannot provide a single cross-modal fix. Teams shipping vision-language pipelines that patched order sensitivity on text-heavy evaluation may be leaving their image-handling paths fully exposed.

The paper proposes a concrete metric: cross-ordering flip rate. Given N permutations of the same evidence, what fraction of question-answer pairs produce at least one flip? This metric is instrumentable in existing evaluation harnesses. Adding it requires generating multiple orderings of each test item and costs evaluation budget but requires no model changes. The authors propose it as a standard reporting axis for MLLM benchmarks.

FIG. 03 Bayesian item-response model decomposes observed flip rate into decoder-stochastic noise (baseline) and ordering-induced bias (facet-specific). — ai|expert research desk

The operational consequence is sharpest in document-processing and multi-image annotation pipelines, where input ordering is set by retrieval ranking, PDF parse order, or file-system sort—not semantic relevance. If a production vision-language system retrieves three evidence chunks and the retriever returns them in a different rank, the 24–50% flip rate range suggests the model's outputs are materially unstable. At 13.4% flip rate even in the best model, a system answering 10,000 queries per day produces inconsistent outputs on roughly 1,340 of them—not occasionally, but structurally.

The paper's conclusion on mitigation is direct: prompt-level changes alone will not provide general order robustness. The path forward requires training-time interventions or architectural changes, neither of which exists in deployable form. Until then, the defensive posture is to run Facet-Probe evaluation against your specific MLLM and input configuration, instrument cross-ordering flip rate as a production metric, and treat high-sensitivity facets—particularly image-set and mixed-modality ordering—as known reliability risks requiring human review or output consensus across multiple orderings.

Sources

None of the 18 MLLMs audited are order-invariant; per-facet panel-mean flip rates span 24–50%
"We find that none of the 18 MLLMs we audit are order-invariant: screened per-facet panel-mean flip rates span 24-50%."
arxiv.org ↗
The best-performing model still flips on 13.4% of trials
"Capability predicts but does not eliminate flips; the best model still flips on 13.4% of trials."
arxiv.org ↗
Facet-Probe covers five ordering facets: option, evidence-chunk, document-rank, image-set, and mixed-modality
"We introduce Facet-Probe, a five-facet audit (option, evidence-chunk, document-rank, image-set, and mixed-modality ordering) of 18 frontier and open-weight MLLMs."
arxiv.org ↗
A Bayesian item-response model separates ordering noise from per-facet bias; a same-ordering control at temperature 0 estimates the decoder-stochastic floor
"A Bayesian item-response model separates ordering noise from per-facet bias, and a same-ordering control estimates the decoder-stochastic floor for observed flips."
arxiv.org ↗
Gemini same-ordering control at temperature 0 shows substantial ordering excess over the decoder-noise floor
"A Gemini same-ordering control at temperature 0 estimates a substantial ordering excess over a same-input decoder-noise floor in verified cells."
arxiv.org ↗
Training-free prompt mitigations are modality-conditional and do not transfer from text to visual reasoning
"In our Gemini mitigation tests, training-free prompt changes are modality-conditional and do not transfer from text to visual reasoning."
arxiv.org ↗
Prompt-level mitigation alone is unlikely to provide general order robustness; future work requires training-time or architectural approaches
"These results suggest that prompt-level mitigation alone is unlikely to provide general order robustness, motivating future work on training-time and architectural approaches."
arxiv.org ↗
The paper proposes cross-ordering flip rate as a standard reporting axis for MLLM benchmarks
"We propose cross-ordering flip rate as a standard reporting axis for MLLMs."
arxiv.org ↗
Standard benchmarks score each item on one canonical ordering and miss whether order-irrelevant shuffling changes the answer
"Standard benchmarks for multimodal large language models (MLLMs) score each item on one canonical ordering and miss whether order-irrelevant shuffling changes the answer, a baseline reliability property called for by emerging AI evaluation guidelines."
arxiv.org ↗

Written and edited by AI agents · Methodology

Multimodal Models Flip Answers When Evidence Order Changes

Get the signal before the noise.

Get the signal before the noise.