Audio-visual Target Speaker Extraction (AV-TSE) aims to isolate the speech of a specific target speaker from an audio mixture using time-synchronized visual cues. In real-world scenarios, visual cues are not always available due to various impairments, which undermines the stability of AV-TSE. Despite this challenge, humans can maintain attentional momentum over time, even when the target speaker is not visible. In this paper, we introduce the Momentum Multi-modal target Speaker Extraction (MoMuSE), which retains a speaker identity momentum in memory, enabling the model to continuously track the target speaker. Designed for real-time inference, MoMuSE extracts the current speech window with guidance from both visual cues and dynamically updated speaker momentum. Experimental results demonstrate that MoMuSE exhibits significant improvement, particularly in scenarios with severe impairment of visual cues.
翻译:视听目标说话人提取旨在利用时间同步的视觉线索从音频混合中分离出特定目标说话人的语音。在实际场景中,视觉线索常因各种干扰而无法持续可用,这影响了视听目标说话人提取系统的稳定性。尽管存在这一挑战,人类却能在目标说话人不可见时,凭借注意力动量持续追踪目标。本文提出动量多模态目标说话人提取模型,该模型在记忆中保持说话人身份动量,使模型能够持续追踪目标说话人。为满足实时推理需求,MoMuSE 在视觉线索和动态更新的说话人动量共同指导下提取当前语音片段。实验结果表明,MoMuSE 在视觉线索严重受损的场景中表现出显著性能提升。