Poster Session 3 · Thursday, December 4, 2025 11:00 AM → 2:00 PM
#4615
DAVE: Diagnostic benchmark for Audio Visual Evaluation
Abstract
Audio-visual understanding is a rapidly evolving field that seeks to integrate and interpret information from both auditory and visual modalities. Despite recent advances in multi-modal learning, existing benchmarks often suffer from strong visual bias — when answers can be inferred from visual data alone — and provide only aggregate scores that conflate multiple sources of error. This makes it difficult to determine whether models struggle with visual understanding, audio interpretation, or audio-visual alignment.
In this work, we introduce DAVE: Diagnostic Audio Visual Evaluation, a novel benchmark dataset designed to systematically evaluate audio-visual models across controlled settings. DAVE alleviates existing limitations by:
- ensuring both modalities are necessary to answer correctly
- decoupling evaluation into atomic subcategories.
Our detailed analysis of state-of-the-art models reveals specific failure modes and provides targeted insights for improvement. By offering this standardized diagnostic framework, we aim to facilitate more robust development of audio-visual models.