4 papers across 2 sessions
We show that adapting vision foundation models using self-supervised fine-tuning with simple object-centric videos substantially improves representation quality without labels.
STSBench is a benchmark that evaluates the capabilities of Multi-modal Large Language Models to reason about spatio-temporal actions.
We introduce CG-SSL, a concept-guided self-supervised learning framework that aligns meaningful image regions across views, achieving state-of-the-art performance on dense prediction tasks.