Until recently, the field of speaker diarization was dominated by cascaded systems. Due to their limitations, mainly regarding overlapped speech and cumbersome pipelines, end-to-end models have gained great popularity lately. One of the most successful models is end-to-end neural diarization with encoder-decoder based attractors (EEND-EDA). In this work, we replace the EDA module with a Perceiver-based one and show its advantages over EEND-EDA; namely obtaining better performance on the largely studied Callhome dataset, finding the quantity of speakers in a conversation more accurately, and faster inference time. Furthermore, when exhaustively compared with other methods, our model, DiaPer, reaches remarkable performance with a very lightweight design. Besides, we perform comparisons with other works and a cascaded baseline across more than ten public wide-band datasets. Together with this publication, we release the code of DiaPer as well as models trained on public and free data.
翻译:直到最近,说话人日志领域一直由级联系统主导。由于其局限性(主要涉及重叠语音和繁琐的处理流程),端到端模型近来获得了极大的关注。其中最成功的模型之一是采用基于编码器-解码器的吸引子机制的端到端神经说话人日志系统(EEND-EDA)。在本研究中,我们使用基于Perceiver的模块替代EDA模块,并证明其相较于EEND-EDA的优势:即在被广泛研究的Callhome数据集上获得更优性能、更准确地识别对话中的说话人数量,以及更快的推理速度。此外,通过与其他方法的全面比较,我们的模型DiaPer以极其轻量化的设计实现了卓越的性能。同时,我们在十余个公开宽带数据集上与其他研究及级联基线进行了对比。伴随本论文的发表,我们同步开源了DiaPer的代码以及在公开免费数据上训练的模型。