In this paper, we make the explicit connection between image segmentation methods and end-to-end diarization methods. From these insights, we propose a novel, fully end-to-end diarization model, EEND-M2F, based on the Mask2Former architecture. Speaker representations are computed in parallel using a stack of transformer decoders, in which irrelevant frames are explicitly masked from the cross attention using predictions from previous layers. EEND-M2F is lightweight, efficient, and truly end-to-end, as it does not require any additional diarization, speaker verification, or segmentation models to run, nor does it require running any clustering algorithms. Our model achieves state-of-the-art performance on several public datasets, such as AMI, AliMeeting and RAMC. Most notably our DER of 16.07% on DIHARD-III is the first major improvement upon the challenge winning system.
翻译:本文明确建立了图像分割方法与端到端说话人日志方法之间的关联。基于这些见解,我们提出了一种新颖的全端到端说话人日志模型EEND-M2F,该模型基于Mask2Former架构。说话人表示通过堆叠的变换器解码器并行计算,其中,利用先前层的预测结果,在交叉注意力中明确屏蔽了无关帧。EEND-M2F轻量、高效且真正实现端到端,因为它不需要任何额外的说话人日志、说话人验证或分割模型来运行,也不需要运行任何聚类算法。我们的模型在多个公开数据集(如AMI、AliMeeting和RAMC)上达到了最先进的性能。尤为值得注意的是,我们在DIHARD-III上取得的16.07%的说话人日志错误率,是对该挑战获胜系统的首次重大改进。