Poster Session 3 · Thursday, December 4, 2025 11:00 AM → 2:00 PM
#3519
Linear Differential Vision Transformer: Learning Visual Contrasts via Pairwise Differentials
Abstract
Vision Transformers (ViTs) have become a universal backbone for both image recognition and image generation. Yet their Multi–Head Self–Attention (MHSA) layer still performs a quadratic query–key interaction for every token pair, spending the bulk of computation on visually weak or redundant correlations.
We introduce Visual–Contrast Attention (VCA), a drop-in replacement for MHSA that injects an explicit notion of discrimination while reducing the theoretical complexity from to with . VCA first distils each head’s dense query field into a handful of spatially pooled visual–contrast tokens, then splits them into a learnable positive and negative stream whose differential interaction highlights what truly separates one region from another.
The module adds fewer than \,M parameters to a DeiT-Tiny backbone, requires no extra FLOPs, and is wholly architecture-agnostic. Empirically, VCA lifts DeiT-Tiny top-1 accuracy on ImageNet-1K from to (+) and improves three strong hierarchical ViTs by up to \%, while in class-conditional ImageNet generation it lowers FID-50K by to points across both diffusion (DiT) and flow (SiT) models.
Extensive ablations confirm that
- spatial pooling supplies low-variance global cues,
- dual positional embeddings are indispensable for contrastive reasoning, and
- combining the two in both stages yields the strongest synergy.
VCA therefore offers a simple path towards faster and sharper Vision Transformers. The source code is available at https://github.com/LeapLabTHU/LinearDiff.