1 paper across 1 session
We propose ETT, an end-to-end tokenizer tuning approach that enables joint optimization between vision tokenization and target autoregressive tasks.