Code-switching speech recognition (CSSR) transcribes speech that switches between multiple languages or dialects within a single sentence. The main challenge in this task is that different languages often have similar pronunciations, making it difficult for models to distinguish between them. In this paper, we propose a method for solving the CSSR task from the perspective of language-specific acoustic boundary learning. We introduce language-specific weight estimators (LSWE) to model acoustic boundary learning in different languages separately. Additionally, a non-autoregressive (NAR) decoder and a language change detection (LCD) module are employed to assist in training. Evaluated on the SEAME corpus, our method achieves a state-of-the-art mixed error rate (MER) of 16.29% and 22.81% on the test_man and test_sge sets. We also demonstrate the effectiveness of our method on a 9000-hour in-house meeting code-switching dataset, where our method achieves a relatively 7.9% MER reduction.
翻译:语码混合语音识别(CSSR)旨在转录同一句子中在多种语言或方言间切换的语音。该任务的主要挑战在于不同语言常具有相似的发音,导致模型难以区分它们。本文提出一种基于语种特异声学边界学习的CSSR任务解决方法。我们引入语种特异权重估计器(LSWE)以分别建模不同语言的声学边界学习过程。此外,采用非自回归(NAR)解码器与语言变化检测(LCD)模块辅助训练。在SEAME语料库上评估,本方法在test_man和test_sge测试集上分别实现了16.29%和22.81%的最优混合错误率(MER)。我们还在9000小时内部会议语码混合数据集上验证了方法的有效性,在该数据集上实现了7.9%的相对MER降低。