Associate Professor, EPFL
3 papers at NeurIPS 2025
OSKAR is a self-supervised multimodal model that predicts masked token latent features from video, skeleton, and text using momentum target encoders—outperforming specialized models across tasks without reconstruction or contrastive losses.
Based on a newly discovered "free lunch" in voxel labels, VoxDet reformulates 3D semantic scene completion as dense object detection using a VoxNT trick.