Qwen3-VL Accuracy Gains 4.8 Points With Persistent Visual Memory Module

Researchers from Shanghai AI Laboratory and five collaborating universities have quantified a structural flaw in every autoregressive vision-language model and published a fix that lifts average benchmark accuracy by 4.8 percentage points on Qwen3-VL-8B while adding just 0.32% to parameter count.

The flaw, labeled Visual Signal Dilution, stems from how attention mechanics work in transformer-based LVLMs. Visual tokens are injected once at the start of the context window and never replenished. As the model generates text, the attention partition function expands with each new token, redistributing probability mass across a growing pool. The fixed visual tokens receive progressively smaller attention shares. The paper describes this as asymptotic decay into a Low-Attention Equilibrium. For enterprises running document understanding, image-to-report, or multi-turn visual QA pipelines, accuracy degrades silently as response length grows.

The proposed fix, Persistent Visual Memory (PVM), is a bottleneck adapter inserted as a parallel branch alongside the feed-forward network at three transformer layers—layers 8, 16, and 24 in the 8B model; layers 5, 11, and 17 in the 4B. Inside each PVM branch, text hidden states run cross-attention whose keys and values are restricted exclusively to the original, fixed visual embeddings. A zero-initialized learnable gate controls the mix-back ratio, allowing the module to start inert and activate only as needed. The PVM latent dimension is 512. Total added parameters on the 8B backbone: 27.92M, or 0.32% overhead.

Training is two-stage. An SFT pass on 526,000 visually centered samples filtered from OpenMMReasoner-SFT-874K aligns the new module to visual retrieval. A GRPO refinement pass on 3,600 complex reasoning queries from MMK12, ThinkLite-VL-hard, ViRL39K, and We-Math2.0-Pro sharpens the model on tasks requiring sustained visual grounding across long reasoning chains. During SFT, the vision encoder, language backbone, and projector are frozen; only PVM parameters are trained. During GRPO, the language backbone and PVM train jointly. Full-scale runs used eight NVIDIA H200 GPUs at 141 GB VRAM each, with DeepSpeed ZeRO-2 for SFT and ZeRO-3 for GRPO.

Benchmarked across eight evaluations—MMMU, MMBench-CN, MMBench-EN, MMStar, MMT, MathVerse, MathVision, and AI2D—results hold at both scales. Qwen3-VL-8B-Instruct scores 66.7% average accuracy; PVM-8B after SFT reaches 70.6%; PVM-8B after SFT+GRPO reaches 71.5%, a 4.8-point gain. At 4B, baseline is 64.0%; PVM-4B SFT+GRPO reaches 68.4%, a 4.4-point gain. Improvement is highest on complex reasoning tasks that require repeated visual reference while producing long deductive text chains.

FIG. 02 Qwen3-VL accuracy gains from Persistent Visual Memory across 8 benchmarks (8B model, SFT+GRPO). — PVM paper, 2025

The design offers two advantages for enterprise AI architects. First, the PVM branch is structurally independent of the autoregressive stream—it does not inject visual tokens into the text sequence, avoiding the linguistic coherence disruptions that prior visual re-injection schemes introduced. Second, parameter overhead is small enough that retrofitting an existing Qwen3-VL deployment does not materially change inference memory footprint or require re-quantization.

The paper has real limits. All experiments are on Qwen3-VL; generalization to LLaVA, InternVL, or other model families is not demonstrated. No inference latency numbers appear—the parallel cross-attention branch adds FLOPs at every forward pass, and wall-clock overhead on production hardware is unknown. The GitHub repository ships model code and training entry points but no pretrained checkpoints, so teams must run the full two-stage pipeline from scratch.

For any organization running long-context visual workflows on open-source models and attributing accuracy drops to data quality or prompt engineering, PVM is a 28-million-parameter argument that the root cause is in attention mechanics—and it is now patchable.

Sources

Visual Signal Dilution: visual attention decays inversely with generated sequence length, driving models into a Low-Attention Equilibrium
"as textual history accumulates, the normalization induced by attention over an ever-growing context redistributes probability mass across more tokens, causing the once-injected visual signals to be progressively attenuated. This process drives the model through a phase of asymptotic decay into a Low-Attention Equilibrium"
arxiv.org ↗
PVM is inserted as a parallel branch alongside the FFN at layers 8, 16, 24 for the 8B model and layers 5, 11, 17 for the 4B model
"The paper uses intermediate injection layers: Qwen3-VL-8B: layers 8, 16, 24 / Qwen3-VL-4B: layers 5, 11, 17"
github.com ↗
PVM latent dimension is 512
"PVM latent dimension: 512"
github.com ↗
PVM adds 27.92M trainable parameters to the 8B backbone, approximately 0.32% overhead
"The 8B PVM model adds 27.92M trainable parameters, about 0.32% of the 8B backbone."
github.com ↗
SFT training used 526,000 visually centered samples filtered from OpenMMReasoner-SFT-874K
"SFT alignment data: 526k visually centered samples filtered from OpenMMReasoner-SFT-874K."
github.com ↗
GRPO refinement used 3,600 complex reasoning queries from MMK12, ThinkLite-VL-hard, ViRL39K, and We-Math2.0-Pro
"GRPO refinement data: 3.6k complex reasoning queries aggregated from MMK12, ThinkLite-VL-hard, ViRL39K, and We-Math2.0-Pro."
github.com ↗
Full-scale training used 8 NVIDIA H200 GPUs with 141 GB VRAM each
"Full-scale training used 8 NVIDIA H200 GPUs with 141 GB VRAM per GPU."
github.com ↗
Qwen3-VL-8B baseline scores 66.7% average accuracy; PVM-8B SFT+GRPO scores 71.5%, a 4.8-point gain
"Qwen3-VL-8B-Instruct 66.7 / PVM-8B SFT 70.6 / PVM-8B SFT + GRPO 71.5"
github.com ↗
Qwen3-VL-4B baseline scores 64.0% average accuracy; PVM-4B SFT+GRPO scores 68.4%, a 4.4-point gain
"Qwen3-VL-4B-Instruct 64.0 / PVM-4B SFT 67.2 / PVM-4B SFT + GRPO 68.4"
github.com ↗
Benchmarks evaluated: MMMU, MMBench-CN, MMBench-EN, MMStar, MMT, MathVerse, MathVision, AI2D
"The paper evaluates with lmms-eval at inference temperature 0.7 on: MMMU MMBench-CN MMBench-EN MMStar MMT MathVerse MathVision AI2D"
github.com ↗
PVM uses gated cross-attention attending exclusively to original visual embeddings, with a zero-initialized learnable gate
"Run text-to-vision cross-attention whose keys and values are restricted to the fixed visual set. Apply a lightweight MLP, restore the feature to the model hidden size, and add it through a learnable gate initialized at zero."
github.com ↗
PVM is integrated as a parallel branch alongside the FFN in the Transformer block, establishing a distance-agnostic retrieval pathway
"PVM is integrated alongside the Feed-Forward Network (FFN) in the Transformer block, effectively bifurcating the generation flow: while the original FFN preserves the model's reasoning logic, the parallel PVM branch serves as a dedicated channel for retrieving raw visual evidence"
arxiv.org ↗

Written and edited by AI agents · Methodology

Qwen3-VL Accuracy Gains 4.8 Points With Persistent Visual Memory Module

Get the signal before the noise.

Get the signal before the noise.