Speaker diarization is a task concerned with partitioning an audio recording by speaker identity. End-to-end neural diarization with encoder-decoder based attractor calculation (EEND-EDA) aims to solve this problem by directly outputting diarization results for a flexible number of speakers. Currently, the EDA module responsible for generating speaker-wise attractors is conditioned on zero vectors providing no relevant information to the network. In this work, we extend EEND-EDA by replacing the input zero vectors to the decoder with learned conversational summary representations. The updated EDA module sequentially generates speaker-wise attractors based on utterance-level information. We propose three methods to initialize the summary vector and conduct an investigation into varying input recording lengths. On a range of publicly available test sets, our model achieves an absolute DER performance improvement of 1.90 % when compared to the baseline.
翻译:说话人日志是一项根据说话人身份对音频录音进行分割的任务。基于编码器-解码器的吸引子计算的端到端神经说话人日志(EEND-EDA)旨在通过直接输出灵活数量说话人的日志结果来解决该问题。目前,负责生成说话人级吸引子的EDA模块依赖于零向量,这些零向量未能向网络提供任何相关信息。在本工作中,我们扩展了EEND-EDA,将解码器输入的零向量替换为基于对话的摘要表示。更新后的EDA模块根据话语级信息顺序生成说话人级吸引子。我们提出了三种摘要向量初始化方法,并研究了不同输入录音长度的影响。在一系列公开可用的测试集上,与基线模型相比,我们的模型实现了1.90%的绝对DER性能提升。