The success of retrieval-augmented language models in various natural language processing (NLP) tasks has been constrained in automatic speech recognition (ASR) applications due to challenges in constructing fine-grained audio-text datastores. This paper presents kNN-CTC, a novel approach that overcomes these challenges by leveraging Connectionist Temporal Classification (CTC) pseudo labels to establish frame-level audio-text key-value pairs, circumventing the need for precise ground truth alignments. We further introduce a skip-blank strategy, which strategically ignores CTC blank frames, to reduce datastore size. kNN-CTC incorporates a k-nearest neighbors retrieval mechanism into pre-trained CTC ASR systems, achieving significant improvements in performance. By incorporating a k-nearest neighbors retrieval mechanism into pre-trained CTC ASR systems and leveraging a fine-grained, pruned datastore, kNN-CTC consistently achieves substantial improvements in performance under various experimental settings. Our code is available at https://github.com/NKU-HLT/KNN-CTC.
翻译:检索增强语言模型在多种自然语言处理任务中取得了成功,但在自动语音识别领域的应用却受到限制,原因在于难以构建细粒度的音频-文本数据存储库。本文提出kNN-CTC方法,通过利用连接主义时序分类(CTC)伪标签建立帧级别的音频-文本键值对,规避了对精确真实对齐标签的依赖,从而克服了这一挑战。我们进一步引入跳空白策略,即策略性地忽略CTC空白帧,以减少数据存储库规模。kNN-CTC将k近邻检索机制融入预训练的CTC语音识别系统,实现了性能的显著提升。通过将k近邻检索机制融入预训练CTC系统并利用细粒度优化后的数据存储库,kNN-CTC在多种实验设置下均能持续获得显著的性能改进。我们的代码已开源至https://github.com/NKU-HLT/KNN-CTC。