Vision Transformer cuts attention cost to 2NC + C² with elastic core tokens
Researchers propose a block-sparse attention structure for Vision Transformers that scales linearly with core tokens instead of quadratically with all tokens, matching DINOv2 accuracy across 256–1024 resolutions.
A new Vision Transformer architecture replaces the standard N² dense self-attention with a core-periphery block-sparse design that scales as 2NC + C² for C core tokens. Released May 13 on arXiv, the approach uses nested dropout during training to enable test-time elastic adjustments to inference cost—practitioners can dial the number of core tokens up or down depending on available compute without retraining.
The model matches DINOv2's dense feature quality and classification accuracy across resolutions from 256 to 1024 pixels. That range matters for practitioners running vision models on everything from mobile devices to high-resolution document analysis. Traditional Vision Transformers hit a quadratic wall at higher resolutions; a 1024×1024 image produces 16× the attention cost of a 256×256 image under dense self-attention. The core-periphery structure keeps that cost linear in the number of core tokens, which the authors can adjust elastically at inference time.
Early-layer attention maps start isotropic—roughly spherical patterns that treat all spatial regions equally. Deeper in the network, those patterns become semantically aligned without explicit supervision. When core token count drops, attention patterns spread over larger spatial regions; when it rises, patterns concentrate into tighter clusters. The nested dropout training strategy enables this elastic property by randomly masking different numbers of core tokens during training, so the model learns to function across a range of computational budgets. At test time, a practitioner chooses how many core tokens to use based on latency requirements or available GPU memory. Code is in progress on GitHub; the arXiv preprint includes dense feature visualizations and comparisons to DINOv2 across the full resolution range.
