Audiovisual representation learning typically relies on the correspondence between sight and sound. However, there are often multiple audio tracks that can correspond with a visual scene. Consider, for example, different conversations on the same crowded street. The effect of such counterfactual pairs on audiovisual representation learning has not been previously explored. To investigate this, we use dubbed versions of movies to augment cross-modal contrastive learning. Our approach learns to represent alternate audio tracks, differing only in speech content, similarly to the same video. Our results show that dub-augmented training improves performance on a range of auditory and audiovisual tasks, without significantly affecting linguistic task performance overall. We additionally compare this approach to a strong baseline where we remove speech before pretraining, and find that dub-augmented training is more effective, including for paralinguistic and audiovisual tasks where speech removal leads to worse performance. These findings highlight the importance of considering speech variation when learning scene-level audiovisual correspondences and suggest that dubbed audio can be a useful augmentation technique for training audiovisual models toward more robust performance.
翻译:视听表征学习通常依赖于视觉与听觉之间的对应关系。然而,同一视觉场景往往对应多种可能的音频轨道——例如,同一个拥挤街道上可能发生不同对话。这种反事实配对对视听表征学习的影响此前尚未被探索。为研究此问题,我们使用电影配音版本增强跨模态对比学习。该方法学习将仅语音内容不同的替代音频轨道,以相似方式表征为同一视频。结果表明,配音增强训练能够提升一系列听觉与视听任务的性能,同时整体上不影响语言任务表现。我们还将该方法与预训练前移除语音的强基线进行对比,发现配音增强训练在副语言及视听任务中更为有效,而移除语音会导致这些任务性能下降。这些发现揭示了学习场景级视听对应关系时考虑语音变异的重要性,并表明配音音频可作为训练视听模型以实现更稳健性能的有效增强技术。