While standard speaker diarization attempts to answer the question "who spoken when", most of relevant applications in reality are more interested in determining "who spoken what". Whether it is the conventional modularized approach or the more recent end-to-end neural diarization (EEND), an additional automatic speech recognition (ASR) model and an orchestration algorithm are required to associate the speaker labels with recognized words. In this paper, we propose Word-level End-to-End Neural Diarization (WEEND) with auxiliary network, a multi-task learning algorithm that performs end-to-end ASR and speaker diarization in the same neural architecture. That is, while speech is being recognized, speaker labels are predicted simultaneously for each recognized word. Experimental results demonstrate that WEEND outperforms the turn-based diarization baseline system on all 2-speaker short-form scenarios and has the capability to generalize to audio lengths of 5 minutes. Although 3+speaker conversations are harder, we find that with enough in-domain training data, WEEND has the potential to deliver high quality diarized text.
翻译:标准说话人日志旨在回答“谁在何时说话”的问题,而实际应用更关注“谁说了什么”。无论是传统模块化方法还是最新端到端神经说话人日志(EEND),均需额外集成自动语音识别(ASR)模型与编排算法,才能将说话人标签与识别出的词语关联。本文提出基于辅助网络的词级端到端神经说话人日志(WEEND),这是一种多任务学习算法,可在统一神经架构中同时执行端到端ASR与说话人日志。即在语音识别过程中,同步为每个识别到的词语预测说话人标签。实验表明,在所有双说话人短时场景中,WEEND性能优于基于话轮切换的基线系统,并具备泛化至5分钟音频长度的能力。尽管三说话人及以上对话场景更具挑战性,但实验发现,在充足领域内训练数据支持下,WEEND有望生成高质量的带说话人标记文本。