Despite recent advances, Automatic Speech Recognition (ASR) systems are still far from perfect. Typical errors include acronyms, named entities and domain-specific special words for which little or no data is available. To address the problem of recognizing these words, we propose an self-supervised continual learning approach. Given the audio of a lecture talk with corresponding slides, we bias the model towards decoding new words from the slides by using a memory-enhanced ASR model from previous work. Then, we perform inference on the talk, collecting utterances that contain detected new words into an adaptation dataset. Continual learning is then performed on this set by adapting low-rank matrix weights added to each weight matrix of the model. The whole procedure is iterated for many talks. We show that with this approach, we obtain increasing performance on the new words when they occur more frequently (more than 80% recall) while preserving the general performance of the model.
翻译:尽管近年来取得了进展,自动语音识别系统仍远非完美。典型的错误包括缩写词、命名实体以及领域特定词汇,这些词汇通常缺乏或完全没有可用数据。为解决这些词汇的识别问题,我们提出了一种自监督的持续学习方法。给定一场讲座的音频及其对应的幻灯片,我们通过使用先前工作中的记忆增强型ASR模型,使模型偏向于解码幻灯片中的新词汇。随后,我们对讲座进行推理,收集包含检测到的新词汇的语句,构建一个适应数据集。在此基础上,通过向模型的每个权重矩阵添加低秩矩阵权重并进行适应,执行持续学习。整个过程对多场讲座进行迭代。实验表明,采用该方法,当新词汇出现频率较高时(召回率超过80%),我们能够持续提升其识别性能,同时保持模型的整体性能。