Linear Differential Vision Transformer: Learning Visual Contrasts via Pairwise Differentials

Yifan Pu, Jixuan Ying, Qixiu Li, Tianzhu Ye, Dongchen Han, Xiaochen Wang, Ziyi Wang, Xinyu Shao, Gao Huang, Xiu Li

Abstract

Vision Transformers (ViTs) have become a universal backbone for both image recognition and image generation. Yet their Multi–Head Self–Attention (MHSA) layer still performs a quadratic query–key interaction for every token pair, spending the bulk of computation on visually weak or redundant correlations.

We introduce Visual–Contrast Attention (VCA), a drop-in replacement for MHSA that injects an explicit notion of discrimination while reducing the theoretical complexity from

O (N^{2} C)

O (N n C)

with

n ≪ N

. VCA first distils each head’s dense query field into a handful of spatially pooled visual–contrast tokens, then splits them into a learnable positive and negative stream whose differential interaction highlights what truly separates one region from another.

The module adds fewer than

0.3

\,M parameters to a DeiT-Tiny backbone, requires no extra FLOPs, and is wholly architecture-agnostic. Empirically, VCA lifts DeiT-Tiny top-1 accuracy on ImageNet-1K from

72.2%

to $75.6%$ (+

3.4

) and improves three strong hierarchical ViTs by up to

3.1

\%, while in class-conditional ImageNet generation it lowers FID-50K by

2.1

5.2

points across both diffusion (DiT) and flow (SiT) models.

Extensive ablations confirm that

spatial pooling supplies low-variance global cues,
dual positional embeddings are indispensable for contrastive reasoning, and
combining the two in both stages yields the strongest synergy.

VCA therefore offers a simple path towards faster and sharper Vision Transformers. The source code is available at https://github.com/LeapLabTHU/LinearDiff.