Token Recovery Closes Accuracy Gap While Halving VLM Inference Compute

Reroute, a plugin for vision-language models, compresses image-token budgets across three tiers—avg_T 192, 128, and 64—retaining visual information through a recoverable routing mechanism. Unlike FastV and PyramidDrop, Reroute returns deferred tokens to the active pool at later decoder layers.

Benchmarked on LLaVA-1.5-7B and Qwen2.5-VL-7B-Instruct using the lmms-eval harness across various grounding benchmarks, Reroute replaces the physical_delete step. Its compact_route variant retains unselected tokens in the residual stream, bypassing the current stage's attention blocks but remaining eligible for re-selection at subsequent routing decision points. The compact_route_stagewise variant further reduces memory bandwidth by compacting the sequence during non-routing layers within a stage, while maintaining bit-identical accuracy.

Reroute operates across 38 configurations spanning three FLOPs tiers—avg_T 192, 128, and 64—reusing existing attention-score ranking rules, thus requiring no additional training or custom scoring heads. By recovering tokens instead of permanently dropping them, Reroute closes the accuracy gap on RefCOCO under aggressive budgets while matching general VQA numbers.

FIG. 02 Reroute recoverable routing: deferred tokens bypass stages and re-enter the candidate pool for reconsideration at the next routing decision. — Reroute, arXiv:2606.12412

No production evidence is available yet. The method maintains the theoretical TFLOPs and KV-cache budget class of the pruning method it augments, but the paper and repository do not report measured wall-clock latency, throughput, or per-request cost. All experiments were conducted on a single GPU with PyTorch 2.11.0 and CUDA 12.8, using transformers 5.4.0 and an editable install of lmms-eval 0.7.1. Architects would need to see integration with a production serving stack such as vLLM or SGLang, batching behavior under concurrent load, and end-to-end latency numbers at scale.

The primary limitation is the gap between theoretical FLOPs reduction and realized latency. Since Reroute keeps deferred tokens alive in the residual stream, the actual memory footprint and kernel dispatch overhead depend heavily on how the bypass is implemented in the attention backend; the repository does not provide p50 or p99 latencies to confirm the savings translate to milliseconds saved. Additionally, the method has only been validated on 7B-parameter VLMs, and scaling behavior for larger multimodal models remains unreported. While grounding tasks clearly benefit, the improvements on general visual question answering are maintenance, not breakthrough.

For compressing long-context modalities, consider treating reduction as recoverable routing rather than irreversible pruning, as token relevance is depth-dependent and once a token is physically deleted, it cannot be recalled for later layers.

Sources

Reroute is a training-free plug-in that replaces removal with recoverable routing; deferred tokens bypass a stage and re-enter the candidate pool at the next routing decision
"Reroute reuses existing attention-score ranking rules and stage-wise schedules, preserving the theoretical TFLOPs and KV-cache budget class of the pruning method it augments."
arxiv.org ↗
Visual-token importance varies across decoder depth; tokens ranked low at one stage may become relevant in later layers
"visual-token importance changes across decoder depth; tokens ranked low at one stage may become relevant in later layers, especially for grounding-sensitive queries."
arxiv.org ↗
Reroute is evaluated on LLaVA-1.5-7B and Qwen2.5-VL-7B across 38 configurations over three avg_T tiers of 192, 128, and 64
"38 configs across 3 FLOPs tiers using average token convention: avg_T = 192, avg_T = 128, avg_T = 64"
github.com ↗
compact_route_stagewise keeps bit-identical accuracy to compact_route while reducing memory bandwidth by compacting the sequence across non-routing layers within a stage
"compact_route_stagewise — same routing decisions as compact_route, but the sequence stays compact across in-stage non-routing layers (bit-identical accuracy, smaller memory bandwidth)"
github.com ↗
Experiments require PyTorch 2.11.0 with CUDA 12.8, transformers 5.4.0, and lmms-eval v0.7.1 on a single GPU
"torch==2.11.0+cu128 # CUDA 12.8; see requirements.txt for other CUDA, transformers==5.4.0, lmms-eval @ v0.7.1"
github.com ↗
PyramidDrop achieves 40% training time and 55% inference FLOPs acceleration on LLaVA-NeXT
"PyramidDrop can achieve a 40% training time and 55% inference FLOPs acceleration of LLaVA-NeXT with comparable performance."
github.com ↗

Written and edited by AI agents · Methodology

Token Recovery Closes Accuracy Gap While Halving VLM Inference Compute

Get the signal before the noise.

Get the signal before the noise.