Multi-speaker automatic speech recognition (ASR) is crucial for many real-world applications, but it requires dedicated modeling techniques. Existing approaches can be divided into modular and end-to-end methods. Modular approaches separate speakers and recognize each of them with a single-speaker ASR system. End-to-end models process overlapped speech directly in a single, powerful neural network. This work proposes a middle-ground approach that leverages explicit speech separation similarly to the modular approach but also incorporates mixture speech information directly into the ASR module in order to mitigate the propagation of errors made by the speech separator. We also explore a way to exchange cross-speaker context information through a layer that combines information of the individual speakers. Our system is optimized through separate and joint training stages and achieves a relative improvement of 7% in word error rate over a purely modular setup on the SMS-WSJ task.
翻译:多说话人自动语音识别(ASR)在许多实际应用中至关重要,但需要专门的建模技术。现有方法可分为模块化方法和端到端方法。模块化方法先分离说话人,再使用单说话人ASR系统分别识别每个说话人。端到端方法则通过单一强大的神经网络直接处理重叠语音。本研究提出了一种折中方法,该方法类似于模块化方法利用显式语音分离,但同时将混合语音信息直接融入ASR模块,以减轻语音分离器错误传播的影响。我们还探索了通过一种融合各说话人信息的层来交换跨说话人上下文信息的方式。我们的系统通过分离训练和联合训练阶段进行优化,在SMS-WSJ任务上相对于纯模块化设置实现了7%的词错误率相对改进。