Overlapped speech is notoriously problematic for speaker diarization systems. Consequently, the use of speech separation has recently been proposed to improve their performance. Although promising, speech separation models struggle with realistic data because they are trained on simulated mixtures with a fixed number of speakers. In this work, we introduce a new speech separation-guided diarization scheme suitable for the online speaker diarization of long meeting recordings with a variable number of speakers, as present in the AMI corpus. We envisage ConvTasNet and DPRNN as alternatives for the separation networks, with two or three output sources. To obtain the speaker diarization result, voice activity detection is applied on each estimated source. The final model is fine-tuned end-to-end, after first adapting the separation to real data using AMI. The system operates on short segments, and inference is performed by stitching the local predictions using speaker embeddings and incremental clustering. The results show that our system improves the state-of-the-art on the AMI headset mix, using no oracle information and under full evaluation (no collar and including overlapped speech). Finally, we show the strength of our system particularly on overlapped speech sections.
翻译:重叠语音是说话人日记化系统面临的一大难题。为此,近期研究提出利用语音分离技术来提升系统性能。尽管语音分离模型前景可观,但由于其通常基于固定说话人数的模拟混合数据训练,在处理真实场景数据时存在局限。本研究提出一种新型语音分离引导的说话人日记化方案,适用于AMI语料库中具有可变说话人数的长时会议录音的在线说话人日记化。我们采用ConvTasNet与DPRNN作为分离网络备选方案,输出源数量设定为二至三个。为获取说话人日记化结果,对每个估计源实施语音活动检测。通过首先利用AMI数据将分离模型适配至真实场景,最终实现端到端微调优化。系统基于短音频片段运行,通过说话人嵌入与增量聚类整合局部预测结果完成推理。实验表明,本系统在AMI头戴麦克风混合数据上无需先验信息,在全评估(无容差且包含重叠语音)条件下达到了当前最优性能。最后,我们验证了系统在重叠语音片段上具有显著优势。