Multi-modal

7 papers across 3 sessions

Poster Session 1

Wednesday, December 3, 2025 · 11:00 AM → 2:00 PM

Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?

#5403 · Apratim Bhattacharyya, Bicheng Xu, Sanjay Haresh, Reza Pourreza, Litian Liu, Sunny Panchal, Leonid Sigal, Roland Memisevic

Current multi-modal LLMs struggles with live, step-by-step task guidance. We built Qualcomm Interactive Cooking (a new dataset with mistake videos and timed feedback) and LiveMamba (a streaming model) to enable better real-time interactive guidance.

Benchmarking Egocentric Multimodal Goal Inference for Assistive Wearable Agents

#4619 Spotlight · Vijay Veerabadran, Fanyi Xiao, Nitin Kamra, Pedro Matias, Joy Chen, Caley Drooff, Brett Roads, Riley J Williams, Ethan Henderson, Xuanyi Zhao, Kevin Carlberg, Joseph Tighe, Karl Ridgeway

First multi-modal (vision, audio, digital context, longitudinal) scripted dataset and benchmark for goal inference for wearable assistant agents.

Poster Session 2

3 papers

Wednesday, December 3, 2025 · 4:30 PM → 7:30 PM

Exhibit Hall C,D,E

Scaling Image Geo-Localization to Continent Level

#4812 · Philipp Lindenberger, Paul-Edouard Sarlin, Jan Hosang, Marc Pollefeys, Simon Lynen, Eduard Trulls

MERIT: Multilingual Semantic Retrieval with Interleaved Multi-Condition Query

#4702 · Wei Chow, Yuan Gao, Linfeng Li, XIAN WANG, Qi Xu, Hang Song, Lingdong Kong, Ran Zhou, Yi Zeng, Yidong Cai, Botian Jiang, Shilin Xu, Jiajunzhang, Minghui Qiu, Xiangtai Li, Tianshu Yang, Siliang Tang, Juncheng Li

We introduce MERIT, the first multilingual dataset for interleaved multi-condition semantic retrieval, revealing current methods' limitations and proposing new method to enhance retrieval performance.

Sampling from multi-modal distributions with polynomial query complexity in fixed dimension via reverse diffusion

#603 · Adrien Vacher, Omar Chehab, Anna Korba

Poster Session 3

2 papers

Thursday, December 4, 2025 · 11:00 AM → 2:00 PM

Exhibit Hall C,D,E

Robust Egocentric Referring Video Object Segmentation via Dual-Modal Causal Intervention

#5003 · Haijing Liu, Zhiyuan Song, Hefeng Wu, Tao Pu, Keze Wang, Liang Lin

This paper introduces CERES that uses dual-modal causal intervention (backdoor adjustment for language bias and front-door adjustment with vision-depth fusion for visual bias) to achieve robust, SOTA egocentric referring video object segmentation.

OmniSegmentor: A Flexible Multi-Modal Learning Framework for Semantic Segmentation

#4903 · Bo-Wen Yin, Jiao-Long Cao, Xuying Zhang, Yuming Chen, Ming-Ming Cheng, Qibin Hou