End-to-End Neural Diarization with Encoder-Decoder based Attractor (EEND-EDA) is an end-to-end neural model for automatic speaker segmentation and labeling. It achieves the capability to handle flexible number of speakers by estimating the number of attractors. EEND-EDA, however, struggles to accurately capture local speaker dynamics. This work proposes an auxiliary loss that aims to guide the Transformer encoders at the lower layer of EEND-EDA model to enhance the effect of self-attention modules using speaker activity information. The results evaluated on public dataset Mini LibriSpeech, demonstrates the effectiveness of the work, reducing Diarization Error Rate from 30.95% to 28.17%. We will release the source code on GitHub to allow further research and reproducibility.
翻译:基于编解码吸引子的端到端神经说话人日志(EEND-EDA)是一种用于自动说话人分割与标注的端到端神经模型。该模型通过估计吸引子数量,实现了处理灵活说话人数量的能力。然而,EEND-EDA难以准确捕捉局部说话人动态。本文提出一种辅助损失函数,旨在引导EEND-EDA模型低层Transformer编码器,利用说话人活动信息增强自注意力模块的效果。在公共数据集Mini LibriSpeech上的评估结果表明,该方法将说话人日志错误率从30.95%降低至28.17%,验证了其有效性。我们将开源相关代码以促进后续研究与结果复现。