1 paper across 1 session
We introduce an extreme token-reduction task and a discrete representation (VQToken) that adaptively compresses video token sequences by 99.93% of their original length with only a 0.66% accuracy drop.