This paper proposes a novel Attention-based Encoder-Decoder network for End-to-End Neural speaker Diarization (AED-EEND). In AED-EEND system, we incorporate the target speaker enrollment information used in target speaker voice activity detection (TS-VAD) to calculate the attractor, which can mitigate the speaker permutation problem and facilitate easier model convergence. In the training process, we propose a teacher-forcing strategy to obtain the enrollment information using the ground-truth label. Furthermore, we propose three heuristic decoding methods to identify the enrollment area for each speaker during the evaluation process. Additionally, we enhance the attractor calculation network LSTM used in the end-to-end encoder-decoder based attractor calculation (EEND-EDA) system by incorporating an attention-based model. By utilizing such an attention-based attractor decoder, our proposed AED-EEND system outperforms both the EEND-EDA and TS-VAD systems with only 0.5s of enrollment data.
翻译:本文提出了一种新颖的基于注意力机制的编码器-解码器网络,用于端到端神经说话人日志(AED-EEND)。在AED-EEND系统中,我们整合了目标说话人语音活动检测(TS-VAD)中使用的目标说话人注册信息来计算吸引子,这能够缓解说话人排列问题并促进模型收敛。在训练过程中,我们提出了一种教师强制策略,利用真实标签获取注册信息。此外,我们提出了三种启发式解码方法,用于在评估过程中识别每个说话人的注册区域。同时,我们通过引入基于注意力机制的模型,增强了端到端编码器-解码器吸引子计算(EEND-EDA)系统中使用的吸引子计算网络LSTM。通过利用这种基于注意力机制的吸引子解码器,我们提出的AED-EEND系统仅使用0.5秒的注册数据,即优于EEND-EDA和TS-VAD系统。