The success of retrieval-augmented language models in various natural language processing (NLP) tasks has been constrained in automatic speech recognition (ASR) applications due to challenges in constructing fine-grained audio-text datastores. This paper presents kNN-CTC, a novel approach that overcomes these challenges by leveraging Connectionist Temporal Classification (CTC) pseudo labels to establish frame-level audio-text key-value pairs, circumventing the need for precise ground truth alignments. We further introduce a skip-blank strategy, which strategically ignores CTC blank frames, to reduce datastore size. kNN-CTC incorporates a k-nearest neighbors retrieval mechanism into pre-trained CTC ASR systems, achieving significant improvements in performance. By incorporating a k-nearest neighbors retrieval mechanism into pre-trained CTC ASR systems and leveraging a fine-grained, pruned datastore, kNN-CTC consistently achieves substantial improvements in performance under various experimental settings. Our code is available at https://github.com/NKU-HLT/KNN-CTC.
翻译:检索增强语言模型在各类自然语言处理任务中取得的成功,在自动语音识别应用中受到限制,主要源于构建细粒度音频-文本数据存储的挑战。本文提出kNN-CTC这一创新方法,通过利用连接主义时序分类伪标签建立帧级音频-文本键值对,绕过了精确真实对齐的需求,成功克服了上述挑战。我们进一步引入跳空白策略,战略性忽略CTC空白帧,以缩减数据存储规模。kNN-CTC将k近邻检索机制集成至预训练CTC语音识别系统中,实现了性能的显著提升。通过将k近邻检索机制整合到预训练CTC语音识别系统,并利用经剪枝的细粒度数据存储,kNN-CTC在各种实验设置下均能持续获得显著的性能改进。我们的代码已开源在https://github.com/NKU-HLT/KNN-CTC。