The kNN-CTC model has proven to be effective for monolingual automatic speech recognition (ASR). However, its direct application to multilingual scenarios like code-switching, presents challenges. Although there is potential for performance improvement, a kNN-CTC model utilizing a single bilingual datastore can inadvertently introduce undesirable noise from the alternative language. To address this, we propose a novel kNN-CTC-based code-switching ASR (CS-ASR) framework that employs dual monolingual datastores and a gated datastore selection mechanism to reduce noise interference. Our method selects the appropriate datastore for decoding each frame, ensuring the injection of language-specific information into the ASR process. We apply this framework to cutting-edge CTC-based models, developing an advanced CS-ASR system. Extensive experiments demonstrate the remarkable effectiveness of our gated datastore mechanism in enhancing the performance of zero-shot Chinese-English CS-ASR.
翻译:kNN-CTC模型已被证明在单语自动语音识别(ASR)任务中具有显著效果。然而,将其直接应用于码切换等多语言场景时仍面临诸多挑战。尽管存在性能提升的潜力,但使用单一双语数据存储的kNN-CTC模型可能无意中引入来自另一种语言的不良噪声。为解决此问题,我们提出一种基于kNN-CTC的新型码切换语音识别(CS-ASR)框架,该框架采用双单语数据存储与门控数据存储选择机制以降低噪声干扰。我们的方法通过为每帧解码选择合适的数据存储,确保将特定语言信息精准注入ASR过程。我们将此框架应用于基于CTC的前沿模型,开发出先进的CS-ASR系统。大量实验证明,我们的门控数据存储机制在提升零样本中英文CS-ASR性能方面具有显著成效。