In recent years, speaker diarization has attracted widespread attention. To achieve better performance, some studies propose to diarize speech in multiple stages. Although these methods might bring additional benefits, most of them are quite complex. Motivated by spelling correction in automatic speech recognition (ASR), in this paper, we propose an end-to-end error correction framework, termed DiaCorrect, to refine the initial diarization results in a simple but efficient way. By exploiting the acoustic interactions between input mixture and its corresponding speaker activity, DiaCorrect could automatically adapt the initial speaker activity to minimize the diarization errors. Without bells and whistles, experiments on LibriSpeech based 2-speaker meeting-like data show that, the self-attentitive end-to-end neural diarization (SA-EEND) baseline with DiaCorrect could reduce its diarization error rate (DER) by over 62.4% from 12.31% to 4.63%. Our source code is available online at https://github.com/jyhan03/diacorrect.
翻译:近年来,说话人日志技术受到广泛关注。为提升性能,部分研究提出了多阶段语音日志方法。尽管这些方法可能带来额外收益,但多数实现较为复杂。受自动语音识别(ASR)中拼写纠错技术的启发,本文提出了一种名为DiaCorrect的端到端纠错框架,以简洁高效的方式优化初始日志结果。通过挖掘输入混合语音与对应说话人活动之间的声学交互,DiaCorrect能够自动调整初始说话人活动,从而最小化日志误差。无需复杂设计,在基于LibriSpeech的两说话人会议仿真数据上的实验表明:集成DiaCorrect的自注意力端到端神经日志(SA-EEND)基线模型,其说话人日志错误率(DER)可从12.31%降至4.63%,降低幅度超过62.4%。源代码已开源至https://github.com/jyhan03/diacorrect。