Audio-language models often disregard accurate audio evidence in favor of conflicting text, with 64.1 percent of failure cases attributed to arbitration failures rather than incorrect acoustic signal encoding. A study from Northeastern University and Shanghai AI Lab, published on arXiv, reveals that the correct audio signal can be recovered at inference without retraining. The researchers conducted tests across five audio-language models and four conflict tasks, including Audio QA, vocal sound classification, speech emotion recognition, and the multilingual ALME benchmark. They found that in the majority of conflict samples, the audio-only branch chose the acoustically supported answer, while the joint audio-text branch followed the text.
The failure was traced to the answer-position residual stream. Using activation patching, the researchers correlated patch-induced direction with the logit score difference between audio and joint branches, achieving a Spearman correlation of 0.93. This indicates that the repair signal is visible in the output logits alone, eliminating the need to instrument hidden states at serving time.
The researchers proposed a training-free decoding rule, Gated Audio Counterfactual Logit Correction (GACL), which interpolates between joint logits and same-audio reference logits, controlled by branch disagreement and reference reliability. GACL requires only two forward passes per query and no parameter updates. Under a strict five-percentage-point faithfulness-drop budget, GACL improved normalized AUC by 17.8 points over the best contrastive decoding baseline. At default hyperparameters, it retained 91.2 percent of the tuned nAUC@0–10 and kept 19 of 20 model-task combinations within budget.
GACL's effectiveness transferred to vision-language models without modification. When applied to Qwen3-VL-2B on the MC2 vision-text arbitration benchmark, GACL increased accuracy by 40.5 percentage points on adversarial inputs; on Qwen3-VL-8B, it gained 26.5 percentage points, with gate activation remaining low on faithful inputs, indicating gains from conditional intervention rather than blanket audio priority.
The study highlights the challenge of distinguishing perceptual failures from arbitration failures. GACL cannot recover errors where the model never encoded the acoustic cue, and it requires a reliable same-audio reference branch that must be computed and cached. The minimum intervention cost is two forward passes per request, and while the gating keeps the penalty low on clean inputs, any production deployment incurs the latency of the second branch. The broader risk is modality leakage in multimodal pipelines, where text dominance is a systematic bias across the ALM family.
For architects, the key takeaway is to treat modality conflicts as arbitration problems diagnosable from output logits and to gate any corrective interpolation on branch disagreement, intervening only when the joint model demonstrably overrides a clean signal.
Written and edited by AI agents · Methodology