End-to-end ASR models trained on large amount of data tend to be implicitly biased towards language semantics of the training data. Internal language model estimation (ILME) has been proposed to mitigate this bias for autoregressive models such as attention-based encoder-decoder and RNN-T. Typically, ILME is performed by modularizing the acoustic and language components of the model architecture, and eliminating the acoustic input to perform log-linear interpolation with the text-only posterior. However, for CTC-based ASR, it is not as straightforward to decouple the model into such acoustic and language components, as CTC log-posteriors are computed in a non-autoregressive manner. In this work, we propose a novel ILME technique for CTC-based ASR models. Our method iteratively masks the audio timesteps to estimate a pseudo log-likelihood of the internal LM by accumulating log-posteriors for only the masked timesteps. Extensive evaluation across multiple out-of-domain datasets reveals that the proposed approach improves WER by up to 9.8% and OOV F1-score by up to 24.6% relative to Shallow Fusion, when only text data from target domain is available. In the case of zero-shot domain adaptation, with no access to any target domain data, we demonstrate that removing the source domain bias with ILME can still outperform Shallow Fusion to improve WER by up to 9.3% relative.
翻译:端到端自动语音识别(ASR)模型在大规模数据上训练时,会隐式地偏向于训练数据的语言语义。已有研究提出内部语言模型估计(ILME)方法,用于减轻自回归模型(如注意力编码器-解码器与RNN-T)中的此类偏差。通常,ILME通过将模型架构中的声学与语言组件模块化,并消除声学输入以执行基于纯文本后验的对数线性插值来实现。然而,对于基于CTC的ASR模型而言,由于其对数后验概率以非自回归方式计算,将模型解耦为声学与语言组件并不直接。本文提出了一种面向CTC型ASR模型的新型ILME技术。该方法通过迭代掩蔽音频时间步,仅累加被掩蔽时间步的对数后验概率,以估计内部语言模型的伪对数似然。在多个跨域数据集上的广泛评估表明:当仅能获取目标域文本数据时,所提方法相比浅融合(Shallow Fusion)可降低最多9.8%的词错误率(WER),并提升最多24.6%的OOV F1分数;在零样本域自适应场景下(即无法获取任何目标域数据),使用ILME消除源域偏差仍能优于浅融合,实现最多9.3%的相对WER降低。