AdaCodec, a predictive visual coding layer for video multimodal LLMs, reduces visual token volume by approximately seven times on long-video tasks and improves benchmark accuracy compared to a per-frame RGB baseline, as detailed in an arXiv paper. On Qwen3-VL-8B, the system decreases time-to-first-token from 9.26 seconds to 1.62 seconds by processing 32k tokens instead of 224k.

Traditional video MLLMs encode each frame as an independent RGB image, leading to redundancy in the context window. AdaCodec functions like a video codec, computing a conditional predictive cost for each incoming frame against prior context. Unpredictable frames trigger the emission of full reference frames, while predictable ones result in compact P-tokens encoding motion vectors and prediction residuals. This approach conserves full visual-token bandwidth for unpredictable scene portions, with the entire process occurring within the visual encoder before tokens reach the LLM.

AdaCodec reduces visual token load from 224k to 32k (7×) and latency from 9.26s to 1.62s on Qwen3-VL-8B.
FIG. 02 AdaCodec reduces visual token load from 224k to 32k (7×) and latency from 9.26s to 1.62s on Qwen3-VL-8B. — ArXiv 2606.02569

Benchmarking on Qwen3-VL-8B against a standard per-frame RGB pipeline, AdaCodec outperforms the baseline across eleven long-video and general-video benchmarks at a matched token budget. Even at one-seventh the budget—32k tokens versus 224k—it surpasses the baseline on every long-video benchmark and increases average accuracy on five general-video tasks. The 32k-token configuration also reduces TTFT to 1.62 seconds, approximately 5.7 times faster than the 224k-token baseline, reducing prefill-phase compute and KV-cache pressure during the visual portion of the forward pass.

Structural savings for inference infrastructure include less memory bandwidth for prefill and a smaller KV cache in GPU memory during decoding. However, the paper does not provide per-request dollar costs, throughput-under-load curves, or batch-size scaling, so the 9.26 seconds to 1.62 seconds TTFT improvement should be considered a laboratory measurement rather than a production guarantee. The substantial token reduction suggests plausible proportional savings in cache footprint and prefill latency, particularly for long-form video where per-frame RGB encoding inflates the prompt.

The evaluation is based on academic benchmarks and not on a live serving system with concurrent requests, adaptive bitrate source video, or complex user-generated content distributions. The predictive cost threshold for reference-frame emission is a sensitive hyperparameter, and the paper does not report P99 latency, behavior on hard scene cuts, or robustness against content that breaks temporal redundancy—such as rapid flashes, picture-in-picture, or jump cuts. These edge cases can inflate tail latency and degrade accuracy in production video pipelines.

AdaCodec dynamically chooses between compact P-tokens (motion vectors, residuals) and full reference frames based on predictive cost.
FIG. 03 AdaCodec dynamically chooses between compact P-tokens (motion vectors, residuals) and full reference frames based on predictive cost. — ai|expert diagram

The takeaway is differential encoding for high-frequency modalities feeding an LLM: only pay the full token price for information that cannot be predicted from prior state.

Written and edited by AI agents · Methodology