Until recently, the field of speaker diarization was dominated by cascaded systems. Due to their limitations, mainly regarding overlapped speech and cumbersome pipelines, end-to-end models have gained great popularity lately. One of the most successful models is end-to-end neural diarization with encoder-decoder based attractors (EEND-EDA). In this work, we replace the EDA module with a Perceiver-based one and show its advantages over EEND-EDA; namely obtaining better performance on the largely studied Callhome dataset, finding the quantity of speakers in a conversation more accurately, and running inference on almost half of the time on long recordings. Furthermore, when exhaustively compared with other methods, our model, DiaPer, reaches remarkable performance with a very lightweight design. Besides, we perform comparisons with other works and a cascaded baseline across more than ten public wide-band datasets. Together with this publication, we release the code of DiaPer as well as models trained on public and free data.
翻译:直到最近,说话人日志领域仍以级联系统为主。由于级联系统存在诸如重叠语音处理能力有限以及流程复杂等局限性,端到端模型近年来获得了广泛关注。其中,基于编码器-解码器吸引子的端到端神经说话人日志模型(EEND-EDA)是最成功的模型之一。本研究中,我们将EDA模块替换为基于感知器(Perceiver)的模块,并展示了其相较于EEND-EDA的优势:在广泛研究的Callhome数据集上获得更优性能,更准确地识别对话中的说话人数量,且在长录音上的推理时间缩短近一半。此外,当与其他方法进行全面对比时,我们的模型DiaPer以极为轻量化的设计达到了卓越性能。同时,我们还在十余个公共宽带数据集上与其它研究及级联基线进行了比较。随本文发布,我们公开了DiaPer的代码以及在公共免费数据上训练的模型。