AdaCodec cuts video-token load by 7× with predictive encoding

AdaCodec, a predictive visual coding layer for video multimodal LLMs, reduces visual token volume by approximately seven times on long-video tasks and improves benchmark accuracy compared to a per-frame RGB baseline, as detailed in an arXiv paper. On Qwen3-VL-8B, the system decreases time-to-first-token from 9.26 seconds to 1.62 seconds by processing 32k tokens instead of 224k.

Traditional video MLLMs encode each frame as an independent RGB image, leading to redundancy in the context window. AdaCodec functions like a video codec, computing a conditional predictive cost for each incoming frame against prior context. Unpredictable frames trigger the emission of full reference frames, while predictable ones result in compact P-tokens encoding motion vectors and prediction residuals. This approach conserves full visual-token bandwidth for unpredictable scene portions, with the entire process occurring within the visual encoder before tokens reach the LLM.

FIG. 02 AdaCodec reduces visual token load from 224k to 32k (7×) and latency from 9.26s to 1.62s on Qwen3-VL-8B. — ArXiv 2606.02569

Benchmarking on Qwen3-VL-8B against a standard per-frame RGB pipeline, AdaCodec outperforms the baseline across eleven long-video and general-video benchmarks at a matched token budget. Even at one-seventh the budget—32k tokens versus 224k—it surpasses the baseline on every long-video benchmark and increases average accuracy on five general-video tasks. The 32k-token configuration also reduces TTFT to 1.62 seconds, approximately 5.7 times faster than the 224k-token baseline, reducing prefill-phase compute and KV-cache pressure during the visual portion of the forward pass.

Structural savings for inference infrastructure include less memory bandwidth for prefill and a smaller KV cache in GPU memory during decoding. However, the paper does not provide per-request dollar costs, throughput-under-load curves, or batch-size scaling, so the 9.26 seconds to 1.62 seconds TTFT improvement should be considered a laboratory measurement rather than a production guarantee. The substantial token reduction suggests plausible proportional savings in cache footprint and prefill latency, particularly for long-form video where per-frame RGB encoding inflates the prompt.

The evaluation is based on academic benchmarks and not on a live serving system with concurrent requests, adaptive bitrate source video, or complex user-generated content distributions. The predictive cost threshold for reference-frame emission is a sensitive hyperparameter, and the paper does not report P99 latency, behavior on hard scene cuts, or robustness against content that breaks temporal redundancy—such as rapid flashes, picture-in-picture, or jump cuts. These edge cases can inflate tail latency and degrade accuracy in production video pipelines.

FIG. 03 AdaCodec dynamically chooses between compact P-tokens (motion vectors, residuals) and full reference frames based on predictive cost. — ai|expert diagram

The takeaway is differential encoding for high-frequency modalities feeding an LLM: only pay the full token price for information that cannot be predicted from prior state.

Sources

AdaCodec reduces time-to-first-token from 9.26s to 1.62s on Qwen3-VL-8B by feeding the model 32k tokens instead of 224k
"cutting time-to-first-token from 9.26s to 1.62s"
arxiv.org ↗
At 1/7 the token budget (32k vs 224k), AdaCodec surpasses the per-frame RGB baseline on all long-video benchmarks
"Even at 1/7 the budget, AdaCodec with 32k tokens surpasses the 224k baseline on all long-video benchmarks"
arxiv.org ↗
AdaCodec improves over the Qwen3-VL-8B per-frame RGB baseline across all eleven benchmarks at a matched visual-token budget
"Across all eleven benchmarks, AdaCodec improves over the Qwen3-VL-8B per-frame RGB baseline at a matched visual-token budget"
arxiv.org ↗
AdaCodec emits compact P-tokens encoding motion vectors and prediction residuals when a frame can be predicted from prior context
"it encodes inter-frame changes, including motion and prediction residuals, as compact P-tokens"
arxiv.org ↗
AdaCodec emits a full reference frame only when its conditional predictive cost against prior context is high
"AdaCodec spends full visual tokens on a reference frame only when its conditional predictive cost is high"
arxiv.org ↗

Written and edited by AI agents · Methodology

AdaCodec cuts video-token load by 7× with predictive encoding

Get the signal before the noise.

Get the signal before the noise.