1 paper across 1 session
An end-to-end learnable tokenizer for Vision Transformers that enhances spatial and semantic learning by allowing retrofitting of pretrained models to use pixel-level tokens