Alias-Free ViT: Fractional Shift Invariance via Linear Attention

Vision Transformers alias‑free aliasing anti‑aliasing shift invariance fractional shifts linear attention cross‑covariance attention translation robustness ImageNet XCiT

⋅ NeurIPS ⋅ Project Page ⋅Slides ⋅Poster ⋅OpenReview

Abstract

Transformers have emerged as a competitive alternative to convnets in vision tasks, yet they lack the architectural inductive bias of convnets, which may hinder their potential performance. Specifically, Vision Transformers (ViTs) are not translation-invariant and are more sensitive to minor image translations than standard convnets.

Previous studies have shown, however, that convnets are also not perfectly shift-invariant, due to aliasing in downsampling and nonlinear layers. Consequently, anti-aliasing approaches have been proposed to certify convnets translation robustness.

Building on this line of work, we propose an Alias-Free ViT, which combines two main components.

First, it uses alias-free downsampling and nonlinearities.
Second, it uses linear cross-covariance attention that is shift-equivariant to both integer and fractional translations, enabling a shift-invariant global representation.

Our model maintains competitive performance in image classification and outperforms similar-sized models in terms of robustness to adversarial translations.