3 papers across 2 sessions
An effective parameter-efficient model merging method for multimodal large language models from the perspective of direction robustness in low-rank space
We show that adapting vision foundation models using self-supervised fine-tuning with simple object-centric videos substantially improves representation quality without labels.