The emergence of audio-visual foundation models underscores the importance of reliably assessing their multi-modal understanding. The VGGSound dataset is commonly used as a benchmark for evaluation audio-visual classification. However, our analysis identifies several limitations of VGGSound, including incomplete labelling, partially overlapping classes, and misaligned modalities. These lead to distorted evaluations of auditory and visual capabilities. To address these limitations, we introduce VGGSounder, a comprehensively re-annotated, multi-label test set that extends VGGSound and is specifically designed to evaluate audio-visual foundation models. VGGSounder features detailed modality annotations, enabling precise analyses of modality-specific performance. Furthermore, we reveal model limitations by analysing performance degradation when adding another input modality with our new modality confusion metric.
翻译:音视频基础模型的出现凸显了可靠评估其多模态理解能力的重要性。VGGSound数据集常被用作音视频分类的评估基准。然而,我们的分析发现VGGSound存在若干局限性,包括标注不完整、部分重叠的类别以及模态不对齐,这些问题导致对听觉和视觉能力的评估出现偏差。为解决上述问题,我们提出了VGGSounder——一个基于VGGSound扩展、经过全面重新标注的多标签测试集,专为基础模型评估而设计。VGGSounder引入了详细的模态标注,支持对特定模态性能进行精确分析。此外,通过我们提出的新模态混淆度量,我们揭示了添加另一输入模态时模型性能下降的局限性。