Audio-visual synchronization aims to determine whether the mouth movements and speech in the video are synchronized. VocaLiST reaches state-of-the-art performance by incorporating multimodal Transformers to model audio-visual interact information. However, it requires high computing resources, making it impractical for real-world applications. This paper proposed an MTDVocaLiST model, which is trained by our proposed multimodal Transformer distillation (MTD) loss. MTD loss enables MTDVocaLiST model to deeply mimic the cross-attention distribution and value-relation in the Transformer of VocaLiST. Additionally, we harness uncertainty weighting to fully exploit the interaction information across all layers. Our proposed method is effective in two aspects: From the distillation method perspective, MTD loss outperforms other strong distillation baselines. From the distilled model's performance perspective: 1) MTDVocaLiST outperforms similar-size SOTA models, SyncNet, and Perfect Match models by 15.65% and 3.35%; 2) MTDVocaLiST reduces the model size of VocaLiST by 83.52%, yet still maintaining similar performance.
翻译:音视频同步旨在判断视频中嘴唇动作与语音是否同步。VocaLiST通过融合多模态Transformer建模音视频交互信息达到了当前最优性能,但其对计算资源要求过高,难以应用于实际场景。本文提出MTDVocaLiST模型,该模型通过提出的多模态Transformer蒸馏(MTD)损失进行训练。MTD损失使MTDVocaLiST模型能够深度模仿VocaLiST中Transformer的交叉注意力分布与值关系。此外,我们利用不确定性加权充分挖掘所有层级的交互信息。所提方法在两个维度具有有效性:从蒸馏方法角度,MTD损失优于其他强蒸馏基线;从蒸馏模型性能角度:1)MTDVocaLiST比同尺寸最优模型SyncNet和Perfect Match分别提升15.65%和3.35%;2)MTDVocaLiST将VocaLiST模型尺寸缩减83.52%,同时保持相近性能。