Clustering speaker embeddings is crucial in speaker diarization but hasn't received as much focus as other components. Moreover, the robustness of speaker diarization across various datasets hasn't been explored when the development and evaluation data are from different domains. To bridge this gap, this study thoroughly examines spectral clustering for both same-domain and cross-domain speaker diarization. Our extensive experiments on two widely used corpora, AMI and DIHARD, reveal the performance trend of speaker diarization in the presence of domain mismatch. We observe that the performance difference between two different domain conditions can be attributed to the role of spectral clustering. In particular, keeping other modules unchanged, we show that differences in optimal tuning parameters as well as speaker count estimation originates due to the mismatch. This study opens several future directions for speaker diarization research.
翻译:聚类说话人嵌入在说话人分割中至关重要,但相较于其他组件受到的关注较少。此外,当开发数据和评估数据来自不同领域时,说话人分割在各种数据集上的鲁棒性尚未被探索。为弥补这一空白,本研究全面考察了谱聚类在同领域和跨领域说话人分割中的表现。我们在两个广泛使用的语料库AMI和DIHARD上进行了大量实验,揭示了领域不匹配情况下说话人分割的性能趋势。我们观察到,两种不同领域条件下的性能差异可归因于谱聚类的作用。特别地,在保持其他模块不变的情况下,我们发现最优调参参数及说话人数量估计的差异源于领域不匹配。本研究为说话人分割研究开辟了若干未来方向。