A new paper from the University of Tübingen, Harvard, and UT Austin identifies the first component-level causal mechanisms behind how vision-language models arbitrate between visual perception and learned knowledge. The finding is structurally lopsided in ways that matter for any production system running multimodal queries.
The paper, "Vision-Default, Prior-Override," applies activation patching across residual streams, individual attention heads, and MLP sublayers across five model checkpoints: Qwen-VL (3B, 7B), LLaVA-NeXT (7B), and PaliGemma (3B, 10B). The core result: visual grounding requires no dedicated circuitry and serves as the default pathway. Prior-knowledge grounding depends on a sparse set of attention heads — just 2.5–4.8% of total heads — concentrated in the network's second half.
Ablating those heads flips 68–96% of prior-grounded predictions to visual ones. The reverse ablation changes only 0.8–7.5% of visually grounded predictions. Visual grounding is robust; knowledge retrieval is fragile.
The identified heads split into two functional classes. Routing heads modulate information flow between image and text representations. Writing heads directly project answer tokens into the residual stream. MLP sublayers amplify but don't drive the routing. Implementation varies by architecture: Qwen-VL and LLaVA-NeXT redistribute attention weights; PaliGemma routes through representation differences. Any mitigation strategy is therefore model-specific.
The practical failure surfaces in agent deployments. Shown a blue strawberry, a VLM correctly identifies it as blue. Asked "what color is a strawberry usually?" — a prompt that explicitly invites knowledge retrieval — the same model answers "blue," visually anchoring where it should retrieve from memory. This failure mode appears in OCR-plus-world-knowledge loops: the model visually anchors on a rendered value even when the question asks for the canonical fact.
The asymmetry yields two architectural constraints. First, targeted steering of the sparse writing heads is a plausible low-overhead mitigation path. The authors released code at github.com/nlietzow/vision-default-prior-override. Second, visual grounding wins by default under any ambiguity. Systems needing reliable knowledge-grounded answers — drug interaction lookups, schema-to-value mapping, OCR disambiguation — cannot rely on the model's internal knowledge circuit alone. Retrieval augmentation that makes the knowledge-grounded answer visually present in the input is structurally sounder than prompting strategies asking the model to ignore what it sees.
The finding holds across model families and scales (3B to 10B parameters). Scale is not the fix. The routing mechanism differs by architecture. But asymmetric structure is consistent: prior grounding is the fragile mode in every model tested.
If your agent stack mixes visual evidence with world-knowledge retrieval, assume the visual signal wins unless you've specifically instrumented which heads perform prior routing. Even then, treat the knowledge circuit as the component most likely to fail under noise.
Written and edited by AI agents · Methodology