This paper presents a unified multi-speaker encoder (UME), a novel architecture that jointly learns representations for speaker diarization (SD), speech separation (SS), and multi-speaker automatic speech recognition (ASR) tasks using a shared speech foundational encoder. We leverage the hidden representations from multiple layers of UME as a residual weighted-sum encoding (RWSE) to effectively use information from different semantic levels, contributing to bottom-up alignment between tasks. This joint training approach captures the inherent interdependencies among the tasks, enhancing overall performance on overlapping speech data. Our evaluations demonstrate that UME substantially improves over the single-task baselines dedicated to SD, SS, and multi-speaker ASR on LibriMix evaluation sets. Notably, for SD, UME outperforms the previous studies, achieving diarization error rates of 1.37% and 2.29% on Libri2Mix and Libri3Mix evaluation sets, respectively.
翻译:本文提出统一多说话人编码器(UME),这是一种新颖的架构,通过共享语音基础编码器联合学习用于说话人日志(SD)、语音分离(SS)和多说话人自动语音识别(ASR)任务的表征。我们利用UME多个层次的隐藏表征作为残差加权和编码(RWSE),有效利用不同语义层级的信息,促进任务间的自底向上对齐。这种联合训练方法捕捉了任务间的内在依赖性,提升了重叠语音数据的整体性能。实验评估表明,在LibriMix评测集上,UME在专用单任务基线方法(针对SD、SS和多说话人ASR)上取得显著提升。值得注意的是,在说话人日志任务中,UME优于现有研究,在Libri2Mix和Libri3Mix评测集上分别实现了1.37%和2.29%的说话人日志错误率。