Poster Session 3 · Thursday, December 4, 2025 11:00 AM → 2:00 PM
#1604
PreFM: Online Audio-Visual Event Parsing via Predictive Future Modeling
Abstract
Audio-visual event parsing plays a crucial role in understanding multimodal video content, but existing methods typically rely on offline processing of entire videos with huge model sizes, limiting their real-time applicability. We introduce Online Audio-Visual Event Parsing (On-AVEP), a novel paradigm for parsing audio, visual, and audio-visual events by sequentially analyzing incoming video streams.
The On-AVEP task necessitates models with two key capabilities:
- Accurate online inference, to effectively distinguish events with unclear and limited context in online settings, and
- Real-time efficiency, to balance high performance with computational constraints.
To cultivate these, we propose the dictive uture odeling (PreFM) framework featured by
- predictive multimodal future modeling to infer and integrate beneficial future audio-visual cues, thereby enhancing contextual understanding and
- modality-agnostic robust representation along with focal temporal prioritization to improve precision and generalization.
Extensive experiments on the UnAV-100 and LLP datasets show PreFM significantly outperforms state-of-the-art methods by a large margin with significantly fewer parameters, offering an insightful approach for real-time multimodal video understanding.