Sparse Attention Heads Redirect Vision-Language Models With 83% Accuracy

Gandikota and Bau at Northeastern University have identified a sparse set of attention heads in Qwen3-VL's language backbone that, when redirected at inference time, can steer generation to an arbitrary target region with 83.1% accuracy. The researchers used six-panel comic strips as a controlled testbed to compute a gaze score for each of Qwen3-VL-8B's 1,152 attention heads, determining whether the 6×6 attention matrix shifts diagonally when the queried panel changes. They found that only layers 20–28 reliably flip the model's answer when adding a reverse-reading direction, confirming that flexible, panel-level routing lives in the attention heads rather than in broad layer-wise biases.

The intervention is precise; redirecting the top-100 gaze heads forces the model to describe any chosen panel when asked the same question. Without steering, the model defaults to the first panel, and the same edit applied to random heads fails, while applying it to all heads destroys generation. This effect generalizes to natural COCO images, recurs across Qwen3-VL sizes from 2B to 32B parameters, and runs in real time. A browser demo loads Qwen3-VL-2B entirely via WebGPU and, using only ten redirected heads, steers output to whichever comic panel the cursor hovers over, even mid-sentence, with streamed text tinted by the panel driving it. No fine-tuning or weight updates are involved; the edits are pure attention-mask operations at inference time.

FIG. 02 Gaze-head intervention in Qwen3-VL-8B: 100 steering heads (8.7% of model's 1,152 total) achieve 83.1% redirection accuracy. — Gandikota & Bau (2025), arxiv.org/abs/2606.14703v1

However, the study reports that some frozen-encoder VLM families show no comparable gaze-head set, so teams running those architectures should not expect the mechanism to exist. Even in compatible models, the steering is brittle: intervening on all 1,152 heads collapses generation quality, which means automated head ranking is mandatory and misidentification is costly. Production systems must also bridge the gap between user-facing pixel coordinates and the model's patch-token grid; the six-panel comic gives clean boundaries, but free-form photographs lack that spatial narrative structure, and 83.1% panel accuracy does not guarantee precise segmentation on messy real-world scenes.

The finding aligns with parallel inference-time steering work from CG-VLM, which demonstrated that object hallucinations are frequently driven by "text inertia"—mid-layer attention drifting from image tokens toward linguistic priors—and showed that reorienting attention without retraining can recover grounding on POPE and CHAIR benchmarks. Together, the papers suggest that visual grounding in production VLMs is maintained by specific, sparse subcircuits that are measurable and correctable in flight, turning hallucinations from a model-retraining problem into an inference-time routing problem.

Architects should consider the diagnostic: a few forward passes with controlled visual prompts can score every head for region-tracking, exposing a targeted inference-time steering layer that costs nothing to deploy—so long as your architecture exposes these heads and you verify that your image tokenizer's patch grid maps cleanly to the semantic regions you need to control.

Sources

Top-100 gaze heads (fewer than 9% of all heads) steer the model's answer to any chosen comic panel at 83.1% accuracy with a single attention-mask intervention, no retraining required
"A single attention-mask intervention on the top-100 gaze heads, fewer than 9% of all heads, steers the model's answer to any chosen comic panel at 83.1% accuracy, while the same intervention on random heads fails to redirect the answer, and intervening on all heads destroys generation."
arxiv.org ↗
Qwen3-VL-8B has 1,152 attention heads total; visual reading order concentrates in layers 20–28
"The model we study most, Qwen3-VL-8B, has 1,152 of them. Only layers 20–28 flip the model's answer from the first panel (green) to the reverse-reading target (red); the same direction does nothing anywhere else in the network."
gaze.baulab.info ↗
The mechanism recurs across model sizes from 2B to 32B parameters; some frozen-encoder families show no comparable gaze-head set
"The mechanism further recurs across model sizes from 2B to 32B parameters and across other VLM architectures, although some frozen-encoder families show no comparable head set."
arxiv.org ↗
Steering generalizes from comic strips to natural COCO images
"Beyond comics, the same intervention redirects answers to chosen regions in natural COCO images."
arxiv.org ↗
Browser demo runs Qwen3-VL-2B entirely via WebGPU using only 10 redirected heads; hovering over panels steers generation mid-sentence
"Qwen3-VL-2B runs entirely in your browser; your cursor becomes the model's gaze. Hover over any panel and the model starts writing about it. Move your cursor mid-sentence to re-steer it."
gaze.baulab.info ↗
Object hallucinations in VLMs are driven by text inertia — attention drifting from visual tokens toward linguistic priors mid-generation
"Large Vision-Language Models (VLMs) often exhibit text inertia, where attention drifts from visual evidence toward linguistic priors, resulting in object hallucinations."
arxiv.org ↗

Written and edited by AI agents · Methodology

Sparse Attention Heads Redirect Vision-Language Models With 83% Accuracy

Get the signal before the noise.

Get the signal before the noise.