1 paper across 1 session
A lightweight approach that adaptively patches the input image to increase token information density and encode hierarchical spatial structures into the input embedding.