Gandikota and Bau at Northeastern University have identified a sparse set of attention heads in Qwen3-VL's language backbone that, when redirected at inference time, can steer generation to an arbitrary target region with 83.1% accuracy. The researchers used six-panel comic strips as a controlled testbed to compute a gaze score for each of Qwen3-VL-8B's 1,152 attention heads, determining whether the 6×6 attention matrix shifts diagonally when the queried panel changes. They found that only layers 20–28 reliably flip the model's answer when adding a reverse-reading direction, confirming that flexible, panel-level routing lives in the attention heads rather than in broad layer-wise biases.

The intervention is precise; redirecting the top-100 gaze heads forces the model to describe any chosen panel when asked the same question. Without steering, the model defaults to the first panel, and the same edit applied to random heads fails, while applying it to all heads destroys generation. This effect generalizes to natural COCO images, recurs across Qwen3-VL sizes from 2B to 32B parameters, and runs in real time. A browser demo loads Qwen3-VL-2B entirely via WebGPU and, using only ten redirected heads, steers output to whichever comic panel the cursor hovers over, even mid-sentence, with streamed text tinted by the panel driving it. No fine-tuning or weight updates are involved; the edits are pure attention-mask operations at inference time.

Gaze-head intervention in Qwen3-VL-8B: 100 steering heads (8.7% of model's 1,152 total) achieve 83.1% redirection accuracy.
FIG. 02 Gaze-head intervention in Qwen3-VL-8B: 100 steering heads (8.7% of model's 1,152 total) achieve 83.1% redirection accuracy. — Gandikota & Bau (2025), arxiv.org/abs/2606.14703v1

However, the study reports that some frozen-encoder VLM families show no comparable gaze-head set, so teams running those architectures should not expect the mechanism to exist. Even in compatible models, the steering is brittle: intervening on all 1,152 heads collapses generation quality, which means automated head ranking is mandatory and misidentification is costly. Production systems must also bridge the gap between user-facing pixel coordinates and the model's patch-token grid; the six-panel comic gives clean boundaries, but free-form photographs lack that spatial narrative structure, and 83.1% panel accuracy does not guarantee precise segmentation on messy real-world scenes.

The finding aligns with parallel inference-time steering work from CG-VLM, which demonstrated that object hallucinations are frequently driven by "text inertia"—mid-layer attention drifting from image tokens toward linguistic priors—and showed that reorienting attention without retraining can recover grounding on POPE and CHAIR benchmarks. Together, the papers suggest that visual grounding in production VLMs is maintained by specific, sparse subcircuits that are measurable and correctable in flight, turning hallucinations from a model-retraining problem into an inference-time routing problem.

Architects should consider the diagnostic: a few forward passes with controlled visual prompts can score every head for region-tracking, exposing a targeted inference-time steering layer that costs nothing to deploy—so long as your architecture exposes these heads and you verify that your image tokenizer's patch grid maps cleanly to the semantic regions you need to control.

Written and edited by AI agents · Methodology