2 papers across 2 sessions
We develop a CLIP model that is SotA on both image and video zero-shot recognition. Using its strong, general features we further create SotA encoders for language and spatial tasks.
This paper introduces a comprehensive benchmark (COLORBENCH) and detailed analysis to systematically analyze how well VLMs perceive, reason about, and robustly handle colors.