Research Manager, Facebook
3 papers at NeurIPS 2025
We generalize CLIP training to worldwide web-scale, with +0.8% better than English only counterpart on zero-shot ImageNet classification (no compromise), SoTA on zero-shot multilingual: 57.4% on CVQA and 50.2% on Babel-ImageNet.
We develop a CLIP model that is SotA on both image and video zero-shot recognition. Using its strong, general features we further create SotA encoders for language and spatial tasks.
We present Collaborative Reasoner (Coral), a framework to evaluate and improve the collaborative reasoning skills of language models.