Attention-based Encoder-Decoder End-to-End Neural Diarization with Embedding Enhancer

Deep neural network-based systems have significantly improved the performance of speaker diarization tasks. However, end-to-end neural diarization (EEND) systems often struggle to generalize to scenarios with an unseen number of speakers, while target speaker voice activity detection (TS-VAD) systems tend to be overly complex. In this paper, we propose a simple attention-based encoder-decoder network for end-to-end neural diarization (AED-EEND). In our training process, we introduce a teacher-forcing strategy to address the speaker permutation problem, leading to faster model convergence. For evaluation, we propose an iterative decoding method that outputs diarization results for each speaker sequentially. Additionally, we propose an Enhancer module to enhance the frame-level speaker embeddings, enabling the model to handle scenarios with an unseen number of speakers. We also explore replacing the transformer encoder with a Conformer architecture, which better models local information. Furthermore, we discovered that commonly used simulation datasets for speaker diarization have a much higher overlap ratio compared to real data. We found that using simulated training data that is more consistent with real data can achieve an improvement in consistency. Extensive experimental validation demonstrates the effectiveness of our proposed methodologies. Our best system achieved a new state-of-the-art diarization error rate (DER) performance on all the CALLHOME (10.08%), DIHARD II (24.64%), and AMI (13.00%) evaluation benchmarks, when no oracle voice activity detection (VAD) is used. Beyond speaker diarization, our AED-EEND system also shows remarkable competitiveness as a speech type detection model.

翻译：基于深度神经网络的方法显著提升了说话人日志任务的性能。然而，端到端神经说话人日志（EEND）系统难以泛化至未见说话人数量的场景，而目标说话人语音活动检测（TS-VAD）系统则往往过于复杂。本文提出了一种简单的基于注意力机制的编码器-解码器网络用于端到端神经说话人日志（AED-EEND）。在训练过程中，我们引入了教师强制策略来解决说话人排列问题，从而加快模型收敛速度。在评估时，我们提出了一种迭代解码方法，逐说话人顺序输出日志结果。此外，我们设计了一个增强模块来提升帧级说话人嵌入表示，使模型能够处理未见说话人数量的场景。我们还探索了将Transformer编码器替换为Conformer架构，更好地建模局部信息。进一步发现，常用的说话人日志仿真数据集的语音重叠率显著高于真实数据，而使用与真实数据更一致的仿真训练数据能带来一致性提升。大规模实验验证了所提方法的有效性。在CALLHOME（10.08%）、DIHARD II（24.64%）和AMI（13.00%）评估基准上，我们的最优系统在无需先验语音活动检测（VAD）的情况下，均实现了新的最优日志错误率（DER）。除说话人日志外，我们的AED-EEND系统在语音类型检测任务中也展现出显著的竞争力。