64 Percent of Audio-Text Conflicts in AI Models Are Fixable

Audio-language models often disregard accurate audio evidence in favor of conflicting text, with 64.1 percent of failure cases attributed to arbitration failures rather than incorrect acoustic signal encoding. A study from Northeastern University and Shanghai AI Lab, published on arXiv, reveals that the correct audio signal can be recovered at inference without retraining. The researchers conducted tests across five audio-language models and four conflict tasks, including Audio QA, vocal sound classification, speech emotion recognition, and the multilingual ALME benchmark. They found that in the majority of conflict samples, the audio-only branch chose the acoustically supported answer, while the joint audio-text branch followed the text.

The failure was traced to the answer-position residual stream. Using activation patching, the researchers correlated patch-induced direction with the logit score difference between audio and joint branches, achieving a Spearman correlation of 0.93. This indicates that the repair signal is visible in the output logits alone, eliminating the need to instrument hidden states at serving time.

FIG. 02 Audio-language models show 64.1% fixable arbitration failures (text overriding correct audio) via activation patching diagnosis (ρ=0.93). — Arxiv 2606.05161

The researchers proposed a training-free decoding rule, Gated Audio Counterfactual Logit Correction (GACL), which interpolates between joint logits and same-audio reference logits, controlled by branch disagreement and reference reliability. GACL requires only two forward passes per query and no parameter updates. Under a strict five-percentage-point faithfulness-drop budget, GACL improved normalized AUC by 17.8 points over the best contrastive decoding baseline. At default hyperparameters, it retained 91.2 percent of the tuned nAUC@0–10 and kept 19 of 20 model-task combinations within budget.

GACL's effectiveness transferred to vision-language models without modification. When applied to Qwen3-VL-2B on the MC2 vision-text arbitration benchmark, GACL increased accuracy by 40.5 percentage points on adversarial inputs; on Qwen3-VL-8B, it gained 26.5 percentage points, with gate activation remaining low on faithful inputs, indicating gains from conditional intervention rather than blanket audio priority.

FIG. 03 GACL transfers unchanged to vision-language arbitration, gaining +40.5 pp for Qwen3-VL-2B and +26.5 pp for Qwen3-VL-8B with low gate rates on faithful inputs. — Arxiv 2606.05161

The study highlights the challenge of distinguishing perceptual failures from arbitration failures. GACL cannot recover errors where the model never encoded the acoustic cue, and it requires a reliable same-audio reference branch that must be computed and cached. The minimum intervention cost is two forward passes per request, and while the gating keeps the penalty low on clean inputs, any production deployment incurs the latency of the second branch. The broader risk is modality leakage in multimodal pipelines, where text dominance is a systematic bias across the ALM family.

For architects, the key takeaway is to treat modality conflicts as arbitration problems diagnosable from output logits and to gate any corrective interpolation on branch disagreement, intervening only when the joint model demonstrably overrides a clean signal.

Sources

64.1% of audio-text conflict samples show a sign flip across five ALMs and four conflict tasks — the audio answer is encoded but overridden during arbitration
"Across five ALMs and four conflict tasks, 64.1% of conflict samples show a sign flip: the same-audio branch prefers the audio-supported answer, whereas the joint branch prefers the text-supported answer."
arxiv.org ↗
Activation patching localizes the reversal to the answer-position residual stream with Spearman ρ=0.93 between patch direction and output logit score difference
"Activation patching further localizes the reversal to answer-position computation, and patching effects closely track output candidate-score differences (Spearman rho=0.93)."
arxiv.org ↗
GACL is a training-free decoding rule requiring two forward passes; under a 5 pp faithfulness-drop budget it improves nAUC by 17.8 points over the best contrastive baseline
"Under a strict 5 pp faithfulness-drop budget, GACL improves nAUC by 17.8 points over the best contrastive baseline and transfers without retuning to vision-text arbitration (up to +40.5 pp)."
arxiv.org ↗
GACL default hyperparameters (λ=0.5, τ_A=0.5) retain 91.2% of tuned nAUC@0–10 and keep 19/20 model-task settings within the 5 pp budget
"τ_A=0.5, still retains 91.2% of tuned nAUC@0–10 and keeps 19/20 model–task settings within the 5 pp budget."
arxiv.org ↗
Applied unchanged to vision-text arbitration, GACL gains +40.5 pp for Qwen3-VL-2B and +26.5 pp for Qwen3-VL-8B with low gate rates on faithful inputs
"GACL achieves +40.5 pp for Qwen3-VL-2B and +26.5 pp for Qwen3-VL-8B at near-zero faithful cost. Gate rates are low on faithful inputs, indicating the gain comes from conditional intervention rather than always prioritizing the reference branch."
arxiv.org ↗
Gemini 2.0 Flash exhibits 16.6% text dominance under audio-text conflict versus 1.6% under text-text conflict — a 10× ratio — even under explicit instructions to trust the audio
"Gemini 2.0 Flash exhibits 16.6% text dominance under audio-text conflict versus 1.6% under text-text conflict with identical reliability cues."
arxiv.org ↗
MCR-Bench found text influence rates exceeding 95% in several model-task conditions; prompting and fine-tuning offer only partial mitigation because models internally detect but don't act on cross-modal inconsistency
"simple prompting techniques—such as bias-aware or audio-prioritized instructions—yield only limited improvements, while supervised finetuning on conflict-rich data offers more promising, though still incomplete, mitigation."
arxiv.org ↗
Complementary audio-specialist heads approach achieves +8 pp on MMAU for Qwen-based LALMs at inference without parameter updates
"this improves accuracy by up to +8.0 percentage points on two Qwen-based LALMs, without any parameter updates."
arxiv.org ↗

Written and edited by AI agents · Methodology

64 Percent of Audio-Text Conflicts in AI Models Are Fixable

Get the signal before the noise.

Get the signal before the noise.