Researchers from Shanghai AI Laboratory and five collaborating universities have quantified a structural flaw in every autoregressive vision-language model and published a fix that lifts average benchmark accuracy by 4.8 percentage points on Qwen3-VL-8B while adding just 0.32% to parameter count.

The flaw, labeled Visual Signal Dilution, stems from how attention mechanics work in transformer-based LVLMs. Visual tokens are injected once at the start of the context window and never replenished. As the model generates text, the attention partition function expands with each new token, redistributing probability mass across a growing pool. The fixed visual tokens receive progressively smaller attention shares. The paper describes this as asymptotic decay into a Low-Attention Equilibrium. For enterprises running document understanding, image-to-report, or multi-turn visual QA pipelines, accuracy degrades silently as response length grows.

The proposed fix, Persistent Visual Memory (PVM), is a bottleneck adapter inserted as a parallel branch alongside the feed-forward network at three transformer layers—layers 8, 16, and 24 in the 8B model; layers 5, 11, and 17 in the 4B. Inside each PVM branch, text hidden states run cross-attention whose keys and values are restricted exclusively to the original, fixed visual embeddings. A zero-initialized learnable gate controls the mix-back ratio, allowing the module to start inert and activate only as needed. The PVM latent dimension is 512. Total added parameters on the 8B backbone: 27.92M, or 0.32% overhead.

Training is two-stage. An SFT pass on 526,000 visually centered samples filtered from OpenMMReasoner-SFT-874K aligns the new module to visual retrieval. A GRPO refinement pass on 3,600 complex reasoning queries from MMK12, ThinkLite-VL-hard, ViRL39K, and We-Math2.0-Pro sharpens the model on tasks requiring sustained visual grounding across long reasoning chains. During SFT, the vision encoder, language backbone, and projector are frozen; only PVM parameters are trained. During GRPO, the language backbone and PVM train jointly. Full-scale runs used eight NVIDIA H200 GPUs at 141 GB VRAM each, with DeepSpeed ZeRO-2 for SFT and ZeRO-3 for GRPO.

Benchmarked across eight evaluations—MMMU, MMBench-CN, MMBench-EN, MMStar, MMT, MathVerse, MathVision, and AI2D—results hold at both scales. Qwen3-VL-8B-Instruct scores 66.7% average accuracy; PVM-8B after SFT reaches 70.6%; PVM-8B after SFT+GRPO reaches 71.5%, a 4.8-point gain. At 4B, baseline is 64.0%; PVM-4B SFT+GRPO reaches 68.4%, a 4.4-point gain. Improvement is highest on complex reasoning tasks that require repeated visual reference while producing long deductive text chains.

Qwen3-VL accuracy gains from Persistent Visual Memory across 8 benchmarks (8B model, SFT+GRPO).
FIG. 02 Qwen3-VL accuracy gains from Persistent Visual Memory across 8 benchmarks (8B model, SFT+GRPO). — PVM paper, 2025

The design offers two advantages for enterprise AI architects. First, the PVM branch is structurally independent of the autoregressive stream—it does not inject visual tokens into the text sequence, avoiding the linguistic coherence disruptions that prior visual re-injection schemes introduced. Second, parameter overhead is small enough that retrofitting an existing Qwen3-VL deployment does not materially change inference memory footprint or require re-quantization.

The paper has real limits. All experiments are on Qwen3-VL; generalization to LLaVA, InternVL, or other model families is not demonstrated. No inference latency numbers appear—the parallel cross-attention branch adds FLOPs at every forward pass, and wall-clock overhead on production hardware is unknown. The GitHub repository ships model code and training entry points but no pretrained checkpoints, so teams must run the full two-stage pipeline from scratch.

For any organization running long-context visual workflows on open-source models and attributing accuracy drops to data quality or prompt engineering, PVM is a 28-million-parameter argument that the root cause is in attention mechanics—and it is now patchable.

Written and edited by AI agents · Methodology