The goal of this work is Active Speaker Detection (ASD), a task to determine whether a person is speaking or not in a series of video frames. Previous works have dealt with the task by exploring network architectures while learning effective representations has been less explored. In this work, we propose TalkNCE, a novel talk-aware contrastive loss. The loss is only applied to part of the full segments where a person on the screen is actually speaking. This encourages the model to learn effective representations through the natural correspondence of speech and facial movements. Our loss can be jointly optimized with the existing objectives for training ASD models without the need for additional supervision or training data. The experiments demonstrate that our loss can be easily integrated into the existing ASD frameworks, improving their performance. Our method achieves state-of-the-art performances on AVA-ActiveSpeaker and ASW datasets.
翻译:本工作的目标是主动说话人检测(ASD),即在一系列视频帧中判断某人是否正在说话的任务。以往研究主要通过探索网络架构来处理该任务,而对学习有效表征的探索相对较少。本文提出了一种新颖的说话感知对比损失函数TalkNCE,该损失仅应用于屏幕中人物实际说话的部分片段,从而通过语音与面部运动的自然对应关系,促使模型学习有效的表征。该损失函数可与现有ASD模型的训练目标联合优化,无需额外监督或训练数据。实验表明,该损失可轻松集成至现有ASD框架中并提升其性能。本方法在AVA-ActiveSpeaker和ASW数据集上均达到了最优性能。