1 paper across 1 session
This paper introduces DAVE, a diagnostic benchmark that requires both audio and visual inputs and separates evaluation into subcategories to reveal specific failure modes in audio-visual models.