Our objective is audio-visual synchronization with a focus on 'in-the-wild' videos, such as those on YouTube, where synchronization cues can be sparse. Our contributions include a novel audio-visual synchronization model, and training that decouples feature extraction from synchronization modelling through multi-modal segment-level contrastive pre-training. This approach achieves state-of-the-art performance in both dense and sparse settings. We also extend synchronization model training to AudioSet a million-scale 'in-the-wild' dataset, investigate evidence attribution techniques for interpretability, and explore a new capability for synchronization models: audio-visual synchronizability.
翻译:我们的目标是面向“野外”视频(例如YouTube上的视频)的音频-视觉同步,其中同步线索可能较为稀疏。我们的贡献包括一种新颖的音频-视觉同步模型,以及通过多模态片段级对比预训练将特征提取与同步建模解耦的训练方法。该方法在密集和稀疏场景下均达到了最先进的性能。我们还将同步模型训练扩展到百万级规模的“野外”数据集AudioSet,研究了用于可解释性的证据归因技术,并探索了同步模型的一项新能力:音频-视觉可同步性。