3 papers across 2 sessions
We release QV-M², the first fully human-annotated multi-moment video benchmark, and present FlashMMR that outperforms prior SOTA by up to 3 % G-mAP, laying a new baseline for multi-moment retrieval.
DualGround mitigates EOS token bias by introducing additional phrase-aware path for fine-grained video-language alignment.