The integration of visual cues has revitalized the performance of the target speech extraction task, elevating it to the forefront of the field. Nevertheless, this multi-modal learning paradigm often encounters the challenge of modality imbalance. In audio-visual target speech extraction tasks, the audio modality tends to dominate, potentially overshadowing the importance of visual guidance. To tackle this issue, we propose AVSepChain, drawing inspiration from the speech chain concept. Our approach partitions the audio-visual target speech extraction task into two stages: speech perception and speech production. In the speech perception stage, audio serves as the dominant modality, while visual information acts as the conditional modality. Conversely, in the speech production stage, the roles are reversed. This transformation of modality status aims to alleviate the problem of modality imbalance. Additionally, we introduce a contrastive semantic matching loss to ensure that the semantic information conveyed by the generated speech aligns with the semantic information conveyed by lip movements during the speech production stage. Through extensive experiments conducted on multiple benchmark datasets for audio-visual target speech extraction, we showcase the superior performance achieved by our proposed method.
翻译:视觉线索的整合重振了目标语音提取任务的性能,使其成为该领域的前沿方向。然而,这种多模态学习范式常面临模态不平衡的挑战。在音视频目标语音提取任务中,音频模态往往占据主导地位,可能削弱视觉引导的重要性。为解决此问题,受语音链概念启发,我们提出AVSepChain方法。该方法将音视频目标语音提取任务划分为两个阶段:语音感知和语音生成。在语音感知阶段,音频作为主导模态,视觉信息作为条件模态;而在语音生成阶段,角色则颠倒过来。这种模态状态的转换旨在缓解模态不平衡的问题。此外,我们引入对比语义匹配损失,以确保在语音生成阶段生成的语音所传达的语义信息与唇部动作所传达的语义信息相一致。通过在多个音视频目标语音提取基准数据集上进行的广泛实验,我们展示了所提方法取得的优越性能。