Many real-life applications of automatic speech recognition (ASR) require processing of overlapped speech. A commonmethod involves first separating the speech into overlap-free streams and then performing ASR on the resulting signals. Recently, the inclusion of a mixture encoder in the ASR model has been proposed. This mixture encoder leverages the original overlapped speech to mitigate the effect of artifacts introduced by the speech separation. Previously, however, the method only addressed two-speaker scenarios. In this work, we extend this approach to more natural meeting contexts featuring an arbitrary number of speakers and dynamic overlaps. We evaluate the performance using different speech separators, including the powerful TF-GridNet model. Our experiments show state-of-the-art performance on the LibriCSS dataset and highlight the advantages of the mixture encoder. Furthermore, they demonstrate the strong separation of TF-GridNet which largely closes the gap between previous methods and oracle separation.
翻译:许多自动语音识别(ASR)的实际应用需要处理重叠语音。一种常见的方法首先将语音分离成无重叠的流,然后对得到的信号执行ASR。最近,有研究提出在ASR模型中引入混合编码器。该混合编码器利用原始重叠语音来减轻语音分离引入的伪影的影响。然而,之前该方法仅针对两说话人场景。在这项工作中,我们将该方法扩展到更自然的会议场景,其特征为任意数量的说话人和动态重叠。我们使用不同的语音分离器(包括强大的TF-GridNet模型)评估性能。我们的实验在LibriCSS数据集上展示了最先进的性能,并突显了混合编码器的优势。此外,实验结果证明了TF-GridNet出色的分离能力,这大大缩小了先前方法与理想分离之间的差距。