DAVE: Diagnostic benchmark for Audio Visual Evaluation

Gorjan Radevski, Teodora Popordanoska, Matthew B. Blaschko, Tinne Tuytelaars

audio-visual dataset multimodal models audio-visual evaluation

Abstract

Audio-visual understanding is a rapidly evolving field that seeks to integrate and interpret information from both auditory and visual modalities. Despite recent advances in multi-modal learning, existing benchmarks often suffer from strong visual bias — when answers can be inferred from visual data alone — and provide only aggregate scores that conflate multiple sources of error. This makes it difficult to determine whether models struggle with visual understanding, audio interpretation, or audio-visual alignment.

In this work, we introduce DAVE: Diagnostic Audio Visual Evaluation, a novel benchmark dataset designed to systematically evaluate audio-visual models across controlled settings. DAVE alleviates existing limitations by:

ensuring both modalities are necessary to answer correctly
decoupling evaluation into atomic subcategories.

Our detailed analysis of state-of-the-art models reveals specific failure modes and provides targeted insights for improvement. By offering this standardized diagnostic framework, we aim to facilitate more robust development of audio-visual models.

Dataset: https://huggingface.co/datasets/gorjanradevski/dave Code: https://github.com/gorjanradevski/dave