1 paper across 1 session
This work utilizes vision foundation models to construct a visual tokenizer, which is trained in an end-to-end manner for AR image generation, achieving state-of-the-art results on the $256\times256$ class-to-image generation task on ImageNet.