Multi-talker overlapped speech recognition remains a significant challenge, requiring not only speech recognition but also speaker diarization tasks to be addressed. In this paper, to better address these tasks, we first introduce speaker labels into an autoregressive transformer-based speech recognition model to support multi-speaker overlapped speech recognition. Then, to improve speaker diarization, we propose a novel speaker mask branch to detection the speech segments of individual speakers. With the proposed model, we can perform both speech recognition and speaker diarization tasks simultaneously using a single model. Experimental results on the LibriSpeech-based overlapped dataset demonstrate the effectiveness of the proposed method in both speech recognition and speaker diarization tasks, particularly enhancing the accuracy of speaker diarization in relatively complex multi-talker scenarios.
翻译:多说话人重叠语音识别仍然是一个重大挑战,不仅需要解决语音识别问题,还需同时处理说话人日记化任务。本文为更好地应对这些任务,首先将说话人标签引入基于自回归变换器的语音识别模型,以支持多说话人重叠语音识别。随后,为改进说话人日记化性能,我们提出了一种新颖的说话人掩码分支,用于检测各说话人的语音片段。借助所提模型,我们能够通过单一模型同时执行语音识别与说话人日记化任务。基于LibriSpeech重叠数据集的实验结果表明,该方法在语音识别和说话人日记化任务中均展现出有效性,尤其在相对复杂的多说话人场景中显著提升了说话人日记化的准确性。