2 papers across 1 session
We introduce TRoVe, an automated approach for discovering error-inducing static feature biases learned by temporal VLMs.
We develop a CLIP model that is SotA on both image and video zero-shot recognition. Using its strong, general features we further create SotA encoders for language and spatial tasks.