7 papers across 3 sessions
Current multi-modal LLMs struggles with live, step-by-step task guidance. We built Qualcomm Interactive Cooking (a new dataset with mistake videos and timed feedback) and LiveMamba (a streaming model) to enable better real-time interactive guidance.
First multi-modal (vision, audio, digital context, longitudinal) scripted dataset and benchmark for goal inference for wearable assistant agents.
We introduce MERIT, the first multilingual dataset for interleaved multi-condition semantic retrieval, revealing current methods' limitations and proposing new method to enhance retrieval performance.
This paper introduces CERES that uses dual-modal causal intervention (backdoor adjustment for language bias and front-door adjustment with vision-depth fusion for visual bias) to achieve robust, SOTA egocentric referring video object segmentation.