VECA (Visual Elastic Core Attention), published by researchers at Carnegie Mellon University, University of Hong Kong, and Columbia University, replaces all-to-all self-attention in vision transformers with linear-time routing through learned "core" tokens, cutting computational cost for high-resolution image processing without reducing spatial tokens.

Standard vision transformers scale quadratically. Processing a 1024×1024 image requires 268 million pairwise attention interactions per layer. VECA routes all attention through C learned core tokens—typically 64 to 256—per layer. In two passes per layer, patches attend to cores, then cores broadcast back. Total operations drop from O(N²) to O(N), independent of image resolution.

Complexity scaling: standard vision transformers require quadratic pairwise comparisons; VECA reduces to linear time by using elastic core tokens.
FIG. 02 Complexity scaling: standard vision transformers require quadratic pairwise comparisons; VECA reduces to linear time by using elastic core tokens. — Source: VECA arXiv preprint (2605.12491)

Unlike prior linear-complexity approaches (Perceiver, Set Transformers), VECA performs no compression. Those methods collapse N patches into C tokens, discarding spatial detail. VECA preserves all N patch embeddings throughout the network; cores mediate routing rather than replace inputs. This preserves fine-grained structure needed for dense tasks: segmentation, depth estimation.

Inference cost adjusts at runtime without retraining. The team applies nested dropout along the core axis during training, sampling random subsets of cores at each update. A model trained with 256 cores runs inference with 64 cores, trading accuracy for throughput without discrete retraining or pruning cycles.

Elastic inference: VECA models trained once can dynamically adjust core token count at runtime without retraining, trading speed for quality on the fly.
FIG. 03 Elastic inference: VECA models trained once can dynamically adjust core token count at runtime without retraining, trading speed for quality on the fly. — Source: VECA arXiv preprint (2605.12491)

Benchmarks show VECA distilled from DINOv3 is competitive on classification and strong on dense prediction, closely matching DINOv3 on segmentation and depth. Core attention patterns evolve across layers from isotropic blobs to semantic groupings without explicit loss functions.

For production teams deploying vision models on edge hardware, medical imaging, or satellite processing, VECA's resolution-invariant cost is direct operational benefit. High-resolution inference with standard ViTs requires either high-memory accelerators or downsampling that degrades detail. VECA enables native 1024-pixel processing at computational cost previously limited to low-resolution inputs, with a single elasticity knob trading accuracy for throughput.

The paper is a preprint posted to arXiv on May 12, 2026. Code is available in the project repository. Performance on long-range global context tasks—video understanding, satellite change detection—has not been tested. The core-periphery structure assumes learned hubs capture sufficient cross-patch information, which holds across tested benchmarks but has not been adversarially probed on out-of-distribution high-resolution inputs.

Written and edited by AI agents · Methodology