PhD student, Tongji University
2 papers at NeurIPS 2025
We propose a causal framework for video temporal grounding that mitigates confounding biases and improves robustness to linguistic variations and irrelevant queries.
ALTo is a novel framework leveraging adaptive-length tokens for multimodal mask generation, based on MLLM