End-to-end neural speaker diarization systems are able to address the speaker diarization task while effectively handling speech overlap. This work explores the incorporation of speaker information embeddings into the end-to-end systems to enhance the speaker discriminative capabilities, while maintaining their overlap handling strengths. To achieve this, we propose several methods for incorporating these embeddings along the acoustic features. Furthermore, we delve into an analysis of the correct handling of silence frames, the window length for extracting speaker embeddings and the transformer encoder size. The effectiveness of our proposed approach is thoroughly evaluated on the CallHome dataset for the two-speaker diarization task, with results that demonstrate a significant reduction in diarization error rates achieving a relative improvement of a 10.78% compared to the baseline end-to-end model.
翻译:端到端神经说话人分离系统能够在处理语音重叠的同时完成说话人分离任务。本研究探索将说话人信息嵌入整合到端到端系统中,以增强说话人区分能力,同时保持其处理重叠语音的优势。为此,我们提出了多种方法将这些嵌入与声学特征相结合。此外,我们深入分析了静默帧的正确处理方法、说话人嵌入提取的窗口长度以及Transformer编码器尺寸的影响。我们在CallHome数据集上对双说话人分离任务进行了全面评估,结果表明所提方法能显著降低分离错误率,相较于基线端到端模型实现了10.78%的相对性能提升。