We propose a novel neural speaker diarization system using memory-aware multi-speaker embedding with sequence-to-sequence architecture (NSD-MS2S), which integrates the strengths of memory-aware multi-speaker embedding (MA-MSE) and sequence-to-sequence (Seq2Seq) architecture, leading to improvement in both efficiency and performance. Next, we further decrease the memory occupation of decoding by incorporating input features fusion and then employ a multi-head attention mechanism to capture features at different levels. NSD-MS2S achieved a macro diarization error rate (DER) of 15.9% on the CHiME-7 EVAL set, which signifies a relative improvement of 49% over the official baseline system, and is the key technique for us to achieve the best performance for the main track of CHiME-7 DASR Challenge. Additionally, we introduce a deep interactive module (DIM) in MA-MSE module to better retrieve a cleaner and more discriminative multi-speaker embedding, enabling the current model to outperform the system we used in the CHiME-7 DASR Challenge. Our code will be available at https://github.com/liyunlongaaa/NSD-MS2S.
翻译:我们提出了一种新颖的神经说话人日志系统——基于记忆增强多说话人嵌入与序列到序列架构的NSD-MS2S模型,该模型融合了记忆增强多说话人嵌入(MA-MSE)与序列到序列(Seq2Seq)架构的优势,在效率与性能两方面均取得提升。进一步地,我们通过引入输入特征融合来降低解码过程中的内存占用,并采用多头注意力机制捕获不同层次的特征。NSD-MS2S在CHiME-7 EVAL数据集上取得了15.9%的宏平均日志错误率(DER),相较于官方基线系统实现了49%的相对改进,这也是我们在CHiME-7 DASR挑战赛主赛道获得最佳性能的关键技术。此外,我们在MA-MSE模块中引入深度交互模块(DIM),以更有效地获取更纯净且更具区分性的多说话人嵌入,使当前模型性能超越我们此前在CHiME-7 DASR挑战赛中使用的系统。相关代码将开源在https://github.com/liyunlongaaa/NSD-MS2S。