Audio-visual understanding is a rapidly evolving field that seeks to integrate and interpret information from both auditory and visual modalities. Despite recent advances in multi-modal learning, existing benchmarks often suffer from strong visual bias -- when answers can be inferred from visual data alone -- and provide only aggregate scores that conflate multiple sources of error. This makes it difficult to determine whether models struggle with visual understanding, audio interpretation, or audio-visual alignment. In this work, we introduce DAVE: Diagnostic Audio Visual Evaluation, a novel benchmark dataset designed to systematically evaluate audio-visual models across controlled settings. DAVE alleviates existing limitations by (i) ensuring both modalities are necessary to answer correctly and (ii) decoupling evaluation into atomic subcategories. Our detailed analysis of state-of-the-art models reveals specific failure modes and provides targeted insights for improvement. By offering this standardized diagnostic framework, we aim to facilitate more robust development of audio-visual models. Dataset: https://huggingface.co/datasets/gorjanradevski/dave Code: https://github.com/gorjanradevski/dave
翻译:视听理解是一个快速发展的领域,旨在整合和解释来自听觉与视觉模态的信息。尽管多模态学习近期取得了进展,但现有基准常存在强烈的视觉偏差——即答案仅通过视觉数据即可推断——且仅提供聚合分数,混淆了多种误差来源。这使得难以确定模型是在视觉理解、音频解释还是视听对齐方面存在困难。在本工作中,我们提出了DAVE:诊断性视听评估,这是一个新颖的基准数据集,旨在通过受控设置系统评估视听模型。DAVE通过以下方式缓解现有局限:(i)确保两种模态均为正确回答所必需;(ii)将评估解耦为原子子类别。我们对最先进模型的详细分析揭示了特定的失败模式,并提供了针对性的改进见解。通过提供这一标准化诊断框架,我们旨在促进视听模型更稳健的发展。数据集:https://huggingface.co/datasets/gorjanradevski/dave 代码:https://github.com/gorjanradevski/dave