VECA Cuts Vision Transformer Inference Cost to Linear Time

VECA (Visual Elastic Core Attention), published by researchers at Carnegie Mellon University, University of Hong Kong, and Columbia University, replaces all-to-all self-attention in vision transformers with linear-time routing through learned "core" tokens, cutting computational cost for high-resolution image processing without reducing spatial tokens.

Standard vision transformers scale quadratically. Processing a 1024×1024 image requires 268 million pairwise attention interactions per layer. VECA routes all attention through C learned core tokens—typically 64 to 256—per layer. In two passes per layer, patches attend to cores, then cores broadcast back. Total operations drop from O(N²) to O(N), independent of image resolution.

FIG. 02 Complexity scaling: standard vision transformers require quadratic pairwise comparisons; VECA reduces to linear time by using elastic core tokens. — Source: VECA arXiv preprint (2605.12491)

Unlike prior linear-complexity approaches (Perceiver, Set Transformers), VECA performs no compression. Those methods collapse N patches into C tokens, discarding spatial detail. VECA preserves all N patch embeddings throughout the network; cores mediate routing rather than replace inputs. This preserves fine-grained structure needed for dense tasks: segmentation, depth estimation.

Inference cost adjusts at runtime without retraining. The team applies nested dropout along the core axis during training, sampling random subsets of cores at each update. A model trained with 256 cores runs inference with 64 cores, trading accuracy for throughput without discrete retraining or pruning cycles.

FIG. 03 Elastic inference: VECA models trained once can dynamically adjust core token count at runtime without retraining, trading speed for quality on the fly. — Source: VECA arXiv preprint (2605.12491)

Benchmarks show VECA distilled from DINOv3 is competitive on classification and strong on dense prediction, closely matching DINOv3 on segmentation and depth. Core attention patterns evolve across layers from isotropic blobs to semantic groupings without explicit loss functions.

For production teams deploying vision models on edge hardware, medical imaging, or satellite processing, VECA's resolution-invariant cost is direct operational benefit. High-resolution inference with standard ViTs requires either high-memory accelerators or downsampling that degrades detail. VECA enables native 1024-pixel processing at computational cost previously limited to low-resolution inputs, with a single elasticity knob trading accuracy for throughput.

The paper is a preprint posted to arXiv on May 12, 2026. Code is available in the project repository. Performance on long-range global context tasks—video understanding, satellite change detection—has not been tested. The core-periphery structure assumes learned hubs capture sufficient cross-patch information, which holds across tested benchmarks but has not been adversarially probed on out-of-distribution high-resolution inputs.

Sources

VECA reduces attention complexity from O(N²) to O(N) for a fixed number of core tokens C
"this yields linear complexity O(N) for predetermined C, which bypasses quadratic scaling"
arxiv.org ↗
VECA attention block requires only (2NC + C²) comparisons versus N² for standard self-attention
"VECA constructs a core-periphery matrix with CC core tokens that form a clique, requiring only 2NC+C² comparisons"
arxiv.org ↗
At 1024×1024 resolution, standard ViT attention must process 16,384 patches — roughly 268 million pairwise interactions
"At 1024×1024 resolution, you're computing attention over 16,384 patches—that's 268 million pairwise interactions"
alanhou.org ↗
VECA core token count is typically 64 to 256
"VECA introduces a small set of C learned "core" tokens (typically 64-256) that act as a communication hub"
alanhou.org ↗
VECA maintains and iteratively updates all N patch tokens, avoiding a C-way bottleneck
"VECA maintains and iteratively updates the full set of N input tokens, avoiding a small C-way bottleneck"
arxiv.org ↗
Nested training on the core axis enables elastic inference; a model trained with 256 cores can run with 64 for faster throughput
"A model trained with 256 cores can run with 64 cores for faster inference, with graceful performance degradation"
alanhou.org ↗
VECA is distilled from a DINOv3 foundation model teacher
"we supervise our model using a DINOv3 teacher"
arxiv.org ↗
VECA closely approaches DINOv3 on segmentation and depth estimation
"VECA remains competitive on classification and is especially strong on dense prediction, closely approaching DINOv3 on segmentation and depth estimation"
arxiv.org ↗
Core attention maps evolve from isotropic to semantically organized groupings without any explicit loss constraint
"Core attention maps start off isotropic (spherical), and become increasingly semantic"
arxiv.org ↗

Written and edited by AI agents · Methodology

VECA Cuts Vision Transformer Inference Cost to Linear Time

Get the signal before the noise.

Get the signal before the noise.