Precise audio-visual synchronization in speech videos is crucial for content quality and viewer comprehension. Existing methods have made significant strides in addressing this challenge through rule-based approaches and end-to-end learning techniques. However, these methods often rely on limited audio-visual representations and suboptimal learning strategies, potentially constraining their effectiveness in more complex scenarios. To address these limitations, we present UniSync, a novel approach for evaluating audio-visual synchronization using embedding similarities. UniSync offers broad compatibility with various audio representations (e.g., Mel spectrograms, HuBERT) and visual representations (e.g., RGB images, face parsing maps, facial landmarks, 3DMM), effectively handling their significant dimensional differences. We enhance the contrastive learning framework with a margin-based loss component and cross-speaker unsynchronized pairs, improving discriminative capabilities. UniSync outperforms existing methods on standard datasets and demonstrates versatility across diverse audio-visual representations. Its integration into talking face generation frameworks enhances synchronization quality in both natural and AI-generated content.
翻译:语音视频中精确的音视频同步对于内容质量和观众理解至关重要。现有方法通过基于规则的方法和端到端学习技术,在应对这一挑战方面取得了显著进展。然而,这些方法通常依赖于有限的音视频表示和次优的学习策略,可能限制了其在更复杂场景中的有效性。为解决这些局限性,我们提出了UniSync,一种利用嵌入相似性评估音视频同步的新方法。UniSync与多种音频表示(如梅尔频谱图、HuBERT)和视觉表示(如RGB图像、人脸解析图、面部关键点、3DMM)具有广泛的兼容性,并能有效处理其显著的维度差异。我们通过引入基于间隔的损失组件和跨说话人非同步对,增强了对比学习框架,从而提升了判别能力。UniSync在标准数据集上超越了现有方法,并在多样化的音视频表示中展现了其通用性。将其集成到说话人脸生成框架中,可提升自然内容与AI生成内容中的同步质量。