Despite recent advances, Automatic Speech Recognition (ASR) systems are still far from perfect. Typical errors include acronyms, named entities and domain-specific special words for which little or no data is available. To address the problem of recognizing these words, we propose an self-supervised continual learning approach. Given the audio of a lecture talk with corresponding slides, we bias the model towards decoding new words from the slides by using a memory-enhanced ASR model from previous work. Then, we perform inference on the talk, collecting utterances that contain detected new words into an adaptation dataset. Continual learning is then performed on this set by adapting low-rank matrix weights added to each weight matrix of the model. The whole procedure is iterated for many talks. We show that with this approach, we obtain increasing performance on the new words when they occur more frequently (more than 80% recall) while preserving the general performance of the model.
翻译:尽管近期取得了进展,自动语音识别(ASR)系统仍远未完善。典型错误包括缩略语、命名实体以及领域特定专业词汇,这些词汇往往缺乏可用数据。为解决此类词汇的识别问题,我们提出一种自监督持续学习方法。给定带有对应幻灯片的讲座音频,我们通过采用先前研究中具有记忆增强功能的ASR模型,使模型偏向于从幻灯片中解码新词汇。随后对讲座进行推理,将包含检测到的新词汇的语句收集为适配数据集。通过调整模型中每个权重矩阵添加的低秩矩阵权重,在此数据集上进行持续学习。整个流程在多个讲座中迭代执行。实验表明,该方法能在新词汇出现频率较高时(召回率超过80%)获得持续提升的识别性能,同时保持模型的整体性能稳定。