Assistant Professor, Télécom Paris
1 paper at NeurIPS 2025
OSKAR is a self-supervised multimodal model that predicts masked token latent features from video, skeleton, and text using momentum target encoders—outperforming specialized models across tasks without reconstruction or contrastive losses.