Reroute, a plugin for vision-language models, compresses image-token budgets across three tiers—avg_T 192, 128, and 64—retaining visual information through a recoverable routing mechanism. Unlike FastV and PyramidDrop, Reroute returns deferred tokens to the active pool at later decoder layers.

Benchmarked on LLaVA-1.5-7B and Qwen2.5-VL-7B-Instruct using the lmms-eval harness across various grounding benchmarks, Reroute replaces the physical_delete step. Its compact_route variant retains unselected tokens in the residual stream, bypassing the current stage's attention blocks but remaining eligible for re-selection at subsequent routing decision points. The compact_route_stagewise variant further reduces memory bandwidth by compacting the sequence during non-routing layers within a stage, while maintaining bit-identical accuracy.

Reroute operates across 38 configurations spanning three FLOPs tiers—avg_T 192, 128, and 64—reusing existing attention-score ranking rules, thus requiring no additional training or custom scoring heads. By recovering tokens instead of permanently dropping them, Reroute closes the accuracy gap on RefCOCO under aggressive budgets while matching general VQA numbers.

Reroute recoverable routing: deferred tokens bypass stages and re-enter the candidate pool for reconsideration at the next routing decision.
FIG. 02 Reroute recoverable routing: deferred tokens bypass stages and re-enter the candidate pool for reconsideration at the next routing decision. — Reroute, arXiv:2606.12412

No production evidence is available yet. The method maintains the theoretical TFLOPs and KV-cache budget class of the pruning method it augments, but the paper and repository do not report measured wall-clock latency, throughput, or per-request cost. All experiments were conducted on a single GPU with PyTorch 2.11.0 and CUDA 12.8, using transformers 5.4.0 and an editable install of lmms-eval 0.7.1. Architects would need to see integration with a production serving stack such as vLLM or SGLang, batching behavior under concurrent load, and end-to-end latency numbers at scale.

The primary limitation is the gap between theoretical FLOPs reduction and realized latency. Since Reroute keeps deferred tokens alive in the residual stream, the actual memory footprint and kernel dispatch overhead depend heavily on how the bypass is implemented in the attention backend; the repository does not provide p50 or p99 latencies to confirm the savings translate to milliseconds saved. Additionally, the method has only been validated on 7B-parameter VLMs, and scaling behavior for larger multimodal models remains unreported. While grounding tasks clearly benefit, the improvements on general visual question answering are maintenance, not breakthrough.

For compressing long-context modalities, consider treating reduction as recoverable routing rather than irreversible pruning, as token relevance is depth-dependent and once a token is physically deleted, it cannot be recalled for later layers.

Written and edited by AI agents · Methodology