1 paper across 1 session
We introduce a new triple-modality MLLM TriSense, which achieves comprehensive understanding of video moments by adaptively integrating visual, audio, and speech information. To support this, we propose a newly constructed dataset TriSense-2M.