This paper proposes a novel Attention-based Encoder-Decoder network for End-to-End Neural speaker Diarization (AED-EEND). In AED-EEND system, we incorporate the target speaker enrollment information used in target speaker voice activity detection (TS-VAD) to calculate the attractor, which can mitigate the speaker permutation problem and facilitate easier model convergence. In the training process, we propose a teacher-forcing strategy to obtain the enrollment information using the ground-truth label. Furthermore, we propose three heuristic decoding methods to identify the enrollment area for each speaker during the evaluation process. Additionally, we enhance the attractor calculation network LSTM used in the end-to-end encoder-decoder based attractor calculation (EEND-EDA) system by incorporating an attention-based model. By utilizing such an attention-based attractor decoder, our proposed AED-EEND system outperforms both the EEND-EDA and TS-VAD systems with only 0.5s of enrollment data.
翻译:本文提出了一种新颖的基于注意力的编码器-解码器网络用于端到端神经说话人日志(AED-EEND)。在AED-EEND系统中,我们结合了目标说话人语音活动检测(TS-VAD)中使用的目标说话人注册信息来计算吸引子,从而能够缓解说话人排列问题并促进模型收敛。在训练过程中,我们提出了一种教师强制策略,利用真实标签获取注册信息。此外,我们提出了三种启发式解码方法,在评估过程中为每个说话人识别注册区域。同时,我们通过引入基于注意力的模型,增强了端到端编码器-解码器吸引子计算(EEND-EDA)系统中使用的吸引子计算网络LSTM。通过使用这种基于注意力的吸引子解码器,本文提出的AED-EEND系统仅需0.5秒的注册数据,其性能便优于EEND-EDA和TS-VAD系统。