PreFM: Online Audio-Visual Event Parsing via Predictive Future Modeling

Xiao Yu, Yan Fang, Yao Zhao, Yunchao Wei

Beijing Jiaotong University· Visual Intelligence + X International Joint Laboratory

Online Audio-Visual Learning Audio-Visual Video Parsing Audio-Visual Event Localization

⋅ NeurIPS ⋅ Project Page ⋅Poster ⋅OpenReview

Abstract

Audio-visual event parsing plays a crucial role in understanding multimodal video content, but existing methods typically rely on offline processing of entire videos with huge model sizes, limiting their real-time applicability. We introduce Online Audio-Visual Event Parsing (On-AVEP), a novel paradigm for parsing audio, visual, and audio-visual events by sequentially analyzing incoming video streams.

The On-AVEP task necessitates models with two key capabilities:

Accurate online inference, to effectively distinguish events with unclear and limited context in online settings, and
Real-time efficiency, to balance high performance with computational constraints.

To cultivate these, we propose the

Pre

dictive

F

uture

M

odeling (PreFM) framework featured by

predictive multimodal future modeling to infer and integrate beneficial future audio-visual cues, thereby enhancing contextual understanding and
modality-agnostic robust representation along with focal temporal prioritization to improve precision and generalization.

Extensive experiments on the UnAV-100 and LLP datasets show PreFM significantly outperforms state-of-the-art methods by a large margin with significantly fewer parameters, offering an insightful approach for real-time multimodal video understanding.