Audio-visual learning has been a major pillar of multi-modal machine learning, where the community mostly focused on its modality-aligned setting, i.e., the audio and visual modality are both assumed to signal the prediction target. With the Look, Listen, and Parse dataset (LLP), we investigate the under-explored unaligned setting, where the goal is to recognize audio and visual events in a video with only weak labels observed. Such weak video-level labels only tell what events happen without knowing the modality they are perceived (audio, visual, or both). To enhance learning in this challenging setting, we incorporate large-scale contrastively pre-trained models as the modality teachers. A simple, effective, and generic method, termed Visual-Audio Label Elaboration (VALOR), is innovated to harvest modality labels for the training events. Empirical studies show that the harvested labels significantly improve an attentional baseline by 8.0 in average F-score (Type@AV). Surprisingly, we found that modality-independent teachers outperform their modality-fused counterparts since they are noise-proof from the other potentially unaligned modality. Moreover, our best model achieves the new state-of-the-art on all metrics of LLP by a substantial margin (+5.4 F-score for Type@AV). VALOR is further generalized to Audio-Visual Event Localization and achieves the new state-of-the-art as well. Code is available at: https://github.com/Franklin905/VALOR.
翻译:视听学习一直是多模态机器学习的重要支柱,该领域主要关注模态对齐设定,即假设音频和视觉模态均能指示预测目标。基于Look, Listen, and Parse数据集(LLP),我们探索了研究不足的非对齐设定,其目标是在仅观察到弱标签的情况下识别视频中的音频和视觉事件。此类弱视频级标签仅能说明发生了哪些事件,却无法获知事件所属的感知模态(音频、视觉或两者兼有)。为增强这一挑战性设定下的学习效果,我们引入大规模对比预训练模型作为模态教师。本文创新提出一种简单、有效且通用的方法——视觉-音频标签精炼(VALOR),用于为训练事件采集模态标签。实证研究表明,所采集的标签在平均F值(Type@AV)上使注意力基线模型提升了8.0。令人惊讶的是,我们发现模态无关教师的表现优于模态融合教师,因为它们能够规避来自其他潜在非对齐模态的噪声干扰。此外,我们的最佳模型在LLP所有指标上以显著优势(Type@AV的F值提升+5.4)达到了新最优水平。VALOR进一步泛化至视听事件定位任务,同样实现了最先进性能。代码开源地址:https://github.com/Franklin905/VALOR。