Speaker diarization in real-world videos presents significant challenges due to varying acoustic conditions, diverse scenes, the presence of off-screen speakers, etc. This paper builds upon a previous study (AVR-Net) and introduces a novel multi-modal speaker diarization system, AFL-Net. The proposed AFL-Net incorporates dynamic lip movement as an additional modality to enhance the identity distinction. Besides, unlike AVR-Net which extracts high-level representations from each modality independently, AFL-Net employs a two-step cross-attention mechanism to sufficiently fuse different modalities, resulting in more comprehensive information to enhance the performance. Moreover, we also incorporated a masking strategy during training, where the face and lip modalities are randomly obscured. This strategy enhances the impact of the audio modality on the system outputs. Experimental results demonstrate that AFL-Net outperforms state-of-the-art baselines, such as the AVR-Net and DyViSE.
翻译:真实世界视频中的说话人日志因其声学条件多变、场景多样、存在画外说话人等挑战而尤为困难。本文在先前研究(AVR-Net)基础上,引入了一种新的多模态说话人日志系统——AFL-Net。所提出的AFL-Net将动态嘴唇运动作为额外模态,以增强身份区分能力。此外,与AVR-Net独立提取各模态高层表示不同,AFL-Net采用两步交叉注意机制充分融合不同模态,从而获得更全面的信息以提升性能。同时,我们在训练过程中引入遮蔽策略,随机遮挡人脸与嘴唇模态,该策略增强了音频模态对系统输出的影响。实验结果表明,AFL-Net在AVR-Net和DyViSE等最先进基线模型中表现更优。