Temporal connectionist temporal classification (CTC)-based automatic speech recognition (ASR) is one of the most successful end to end (E2E) ASR frameworks. However, due to the token independence assumption in decoding, an external language model (LM) is required which destroys its fast parallel decoding property. Several studies have been proposed to transfer linguistic knowledge from a pretrained LM (PLM) to the CTC based ASR. Since the PLM is built from text while the acoustic model is trained with speech, a cross-modal alignment is required in order to transfer the context dependent linguistic knowledge from the PLM to acoustic encoding. In this study, we propose a novel cross-modal alignment algorithm based on optimal transport (OT). In the alignment process, a transport coupling matrix is obtained using OT, which is then utilized to transform a latent acoustic representation for matching the context-dependent linguistic features encoded by the PLM. Based on the alignment, the latent acoustic feature is forced to encode context dependent linguistic information. We integrate this latent acoustic feature to build conformer encoder-based CTC ASR system. On the AISHELL-1 data corpus, our system achieved 3.96% and 4.27% character error rate (CER) for dev and test sets, respectively, which corresponds to relative improvements of 28.39% and 29.42% compared to the baseline conformer CTC ASR system without cross-modal knowledge transfer.
翻译:时序连接时序分类(CTC)自动语音识别(ASR)是端到端(E2E)ASR框架中最为成功的方案之一。然而,由于解码过程中存在令牌独立性假设,需要引入外部语言模型(LM),这破坏了其快速并行解码的特性。已有研究提出将预训练语言模型(PLM)的语言知识迁移至基于CTC的ASR。由于PLM基于文本构建而声学模型通过语音训练,为将上下文相关的语言知识从PLM迁移至声学编码,需要实现跨模态对齐。本研究提出一种基于最优传输(OT)的新型跨模态对齐算法。在对齐过程中,通过OT获取传输耦合矩阵,该矩阵随后用于变换潜在声学表征,使其与PLM编码的上下文相关语言特征相匹配。基于此对齐,潜在声学特征被迫编码上下文相关的语言信息。我们将该潜在声学特征集成至基于Conformer编码器的CTC ASR系统中。在AISHELL-1数据集上,本系统在开发集和测试集上分别实现了3.96%和4.27%的字错误率(CER),相比未进行跨模态知识迁移的基线Conformer CTC ASR系统,分别获得了28.39%和29.42%的相对性能提升。