Zhejiang Laboratory - NeurIPS 2025

🏛 Zhejiang Laboratory

1 paper across 1 session

Poster Session 3

Thursday, December 4, 2025 · 11:00 AM → 2:00 PM

Watch and Listen: Understanding Audio-Visual-Speech Moments with Multimodal LLM

#4919 · Zinuo Li, Xian Zhang, Yongxin Guo, Mohammed Bennamoun, Farid Boussaid, Girish Dwivedi, Luqi Gong, Qiuhong Ke

We introduce a new triple-modality MLLM TriSense, which achieves comprehensive understanding of video moments by adaptively integrating visual, audio, and speech information. To support this, we propose a newly constructed dataset TriSense-2M.