Vision-language models route knowledge through just 2.5% of network

Mechanistic analysis of three VLM families reveals how they resolve conflicts between visual evidence and memorized knowledge: visual grounding dominates when evidence is clear, but knowledge priors hijack reasoning under visual noise. Understanding this tradeoff shapes multimodal reliability at scale.

A new paper from the University of Tübingen, Harvard, and UT Austin identifies the first component-level causal mechanisms behind how vision-language models arbitrate between visual perception and learned knowledge. The finding is structurally lopsided in ways that matter for any production system running multimodal queries.

The paper, "Vision-Default, Prior-Override," applies activation patching across residual streams, individual attention heads, and MLP sublayers across five model checkpoints: Qwen-VL (3B, 7B), LLaVA-NeXT (7B), and PaliGemma (3B, 10B). The core result: visual grounding requires no dedicated circuitry and serves as the default pathway. Prior-knowledge grounding depends on a sparse set of attention heads — just 2.5–4.8% of total heads — concentrated in the network's second half.

Ablating those heads flips 68–96% of prior-grounded predictions to visual ones. The reverse ablation changes only 0.8–7.5% of visually grounded predictions. Visual grounding is robust; knowledge retrieval is fragile.

FIG. 02 Ablation reveals asymmetric routing: prior knowledge gates depend on sparse heads; visual information is routed independently. — Tübingen & Harvard, 2024

The identified heads split into two functional classes. Routing heads modulate information flow between image and text representations. Writing heads directly project answer tokens into the residual stream. MLP sublayers amplify but don't drive the routing. Implementation varies by architecture: Qwen-VL and LLaVA-NeXT redistribute attention weights; PaliGemma routes through representation differences. Any mitigation strategy is therefore model-specific.

FIG. 03 Two-class decomposition of sparse heads: routing gates information flow; writing projects predictions into the residual stream. — ai|expert interpretation

The practical failure surfaces in agent deployments. Shown a blue strawberry, a VLM correctly identifies it as blue. Asked "what color is a strawberry usually?" — a prompt that explicitly invites knowledge retrieval — the same model answers "blue," visually anchoring where it should retrieve from memory. This failure mode appears in OCR-plus-world-knowledge loops: the model visually anchors on a rendered value even when the question asks for the canonical fact.

The asymmetry yields two architectural constraints. First, targeted steering of the sparse writing heads is a plausible low-overhead mitigation path. The authors released code at github.com/nlietzow/vision-default-prior-override. Second, visual grounding wins by default under any ambiguity. Systems needing reliable knowledge-grounded answers — drug interaction lookups, schema-to-value mapping, OCR disambiguation — cannot rely on the model's internal knowledge circuit alone. Retrieval augmentation that makes the knowledge-grounded answer visually present in the input is structurally sounder than prompting strategies asking the model to ignore what it sees.

The finding holds across model families and scales (3B to 10B parameters). Scale is not the fix. The routing mechanism differs by architecture. But asymmetric structure is consistent: prior grounding is the fragile mode in every model tested.

If your agent stack mixes visual evidence with world-knowledge retrieval, assume the visual signal wins unless you've specifically instrumented which heads perform prior routing. Even then, treat the knowledge circuit as the component most likely to fail under noise.

Sources

Prior-knowledge grounding depends on 2.5–4.8% of attention heads concentrated in the second half of the network; ablating them flips 68–96% of prior-grounded predictions to visually grounded answers while changing only 0.8–7.5% of visually grounded predictions
"visual grounding emerges by default, whereas prior grounding depends on a small set of causally necessary attention heads (2.5-4.8%) concentrated in the second half of the network... Ablating them flips predictions from knowledge-grounded to visually grounded answers in 68-96% of cases under prior-knowledge prompts, but changes only 0.8-7.5% of visually grounded predictions"
arxiv.org ↗
Identified heads decompose into routing heads that modulate information flow and writing heads that directly project answer tokens into the residual stream; MLP sublayers play an amplifier role
"The identified heads decompose into routing heads, which modulate information flow, and writing heads, which directly project answer tokens into the residual stream. This structure is consistent across model families and scales"
arxiv.org ↗
Tested across Qwen-VL 3B/7B, LLaVA-NeXT 7B, and PaliGemma 3B/10B; Qwen-VL and LLaVA-NeXT redistribute attention between image and text tokens while PaliGemma routes through differences in attended representations
"the routing implementation diverges across architectures: Qwen-VL and LLaVA-NeXT redistribute attention between image and text tokens, whereas PaliGemma routes through differences in the attended representations"
arxiv.org ↗
VLMs visually anchor even when prompted for prior-knowledge answers — shown a blue strawberry and asked 'what color is a strawberry usually?', the model still answers based on visual input
"when asked 'what color is a strawberry usually?', a question that should rely on prior knowledge rather than the image, the model frequently continues to respond based on the observed visual input"
arxiv.org ↗

Written and edited by AI agents · Methodology

Vision-language models route knowledge through just 2.5% of network

Get the signal before the noise.

Get the signal before the noise.