Until recently, the field of speaker diarization was dominated by cascaded systems. Due to their limitations, mainly regarding overlapped speech and cumbersome pipelines, end-to-end models have gained great popularity lately. One of the most successful models is end-to-end neural diarization with encoder-decoder based attractors (EEND-EDA). In this work, we replace the EDA module with a Perceiver-based one and show its advantages over EEND-EDA; namely obtaining better performance on the largely studied Callhome dataset, finding the quantity of speakers in a conversation more accurately, and running inference on almost half of the time on long recordings. Furthermore, when exhaustively compared with other methods, our model, DiaPer, reaches remarkable performance with a very lightweight design. Besides, we perform comparisons with other works and a cascaded baseline across more than ten public wide-band datasets. Together with this publication, we release the code of DiaPer as well as models trained on public and free data.
翻译:直到最近,说话人日志领域仍以级联系统为主导。由于级联系统的局限性(主要涉及重叠语音和繁琐的流水线),端到端模型近年来广受欢迎。其中最成功的模型之一是基于编码器-解码器吸引子的端到端神经说话人日志(EEND-EDA)。在本工作中,我们将EDA模块替换为基于感知器的模块,并展示了其相较于EEND-EDA的优势:在大规模研究的Callhome数据集上获得更优性能,更准确地识别对话中的说话人数量,并在长录音上几乎将推理时间缩短一半。此外,与其他方法进行详尽比较后,我们的模型DiaPer以极轻量化的设计取得了显著性能。同时,我们与其他研究工作及一个级联基线方法在十余个公开宽带数据集上进行了对比。在本论文发表之际,我们公开了DiaPer的代码以及基于公开和免费数据训练的模型。