Current speaker diarization systems rely on an external voice activity detection model prior to speaker embedding extraction on the detected speech segments. In this paper, we establish that the attention system of a speaker embedding extractor acts as a weakly supervised internal VAD model and performs equally or better than comparable supervised VAD systems. Subsequently, speaker diarization can be performed efficiently by extracting the VAD logits and corresponding speaker embedding simultaneously, alleviating the need and computational overhead of an external VAD model. We provide an extensive analysis of the behavior of the frame-level attention system in current speaker verification models and propose a novel speaker diarization pipeline using ECAPA2 speaker embeddings for both VAD and embedding extraction. The proposed strategy gains state-of-the-art performance on the AMI, VoxConverse and DIHARD III diarization benchmarks.
翻译:当前说话人分割系统依赖外部语音活动检测模型,在检测到的语音片段上提取说话人嵌入。本文证明说话人嵌入提取器的注意力机制可作为弱监督内部VAD模型,其性能与可比监督式VAD系统相当或更优。通过同步提取VAD逻辑值与对应说话人嵌入,可高效实现说话人分割,由此消除了对外部VAD模型的需求及计算开销。我们深入分析了当前说话人验证模型中帧级注意力机制的行为特性,并提出使用ECAPA2说话人嵌入进行VAD与嵌入提取的新型说话人分割流水线。所提方法在AMI、VoxConverse和DIHARD III分割基准测试中取得了最优性能。