End-to-end neural diarization with encoder-decoder based attractors (EEND-EDA) is a method to perform diarization in a single neural network. EDA handles the diarization of a flexible number of speakers by using an LSTM-based encoder-decoder that generates a set of speaker-wise attractors in an autoregressive manner. In this paper, we propose to replace EDA with a transformer-based attractor calculation (TA) module. TA is composed of a Combiner block and a Transformer decoder. The main function of the combiner block is to generate conversational dependent (CD) embeddings by incorporating learned conversational information into a global set of embeddings. These CD embeddings will then serve as the input for the transformer decoder. Results on public datasets show that EEND-TA achieves 2.68% absolute DER improvement over EEND-EDA. EEND-TA inference is 1.28 times faster than that of EEND-EDA.
翻译:基于编码器-解码器吸引子的端到端神经话者分离(EEND-EDA)是一种在单一神经网络中执行话者分离的方法。EDA通过使用基于LSTM的编码器-解码器,以自回归方式生成一组逐说话者吸引子,从而处理可变数量说话者的话者分离。本文提出用基于Transformer的吸引子计算(TA)模块替代EDA。TA由组合块和Transformer解码器组成。组合块的主要功能是通过将学习到的对话信息融入全局嵌入集,生成对话依赖(CD)嵌入。这些CD嵌入随后作为Transformer解码器的输入。公开数据集上的结果表明,EEND-TA在绝对DER上比EEND-EDA提升了2.68%。EEND-TA的推理速度是EEND-EDA的1.28倍。