Due to the modality discrepancy between textual and acoustic modeling, efficiently transferring linguistic knowledge from a pretrained language model (PLM) to acoustic encoding for automatic speech recognition (ASR) still remains a challenging task. In this study, we propose a cross-modality knowledge transfer (CMKT) learning framework in a temporal connectionist temporal classification (CTC) based ASR system where hierarchical acoustic alignments with the linguistic representation are applied. Additionally, we propose the use of Sinkhorn attention in cross-modality alignment process, where the transformer attention is a special case of this Sinkhorn attention process. The CMKT learning is supposed to compel the acoustic encoder to encode rich linguistic knowledge for ASR. On the AISHELL-1 dataset, with CTC greedy decoding for inference (without using any language model), we achieved state-of-the-art performance with 3.64% and 3.94% character error rates (CERs) for the development and test sets, which corresponding to relative improvements of 34.18% and 34.88% compared to the baseline CTC-ASR system, respectively.
翻译:由于文本建模与声学建模之间的模态差异,如何高效地将预训练语言模型(PLM)的 linguistic 知识迁移至声学编码器以支持自动语音识别(ASR)仍是一项具有挑战性的任务。本研究在基于时序连接主义分类(CTC)的ASR系统中提出了一种跨模态知识迁移(CMKT)学习框架,该框架应用了与语言表征对齐的层级声学特征。此外,我们提出在跨模态对齐过程中使用Sinkhorn注意力机制,其中Transformer注意力是该Sinkhorn注意力过程的一种特例。CMKT学习旨在强制声学编码器为ASR编码丰富的语言知识。在AISHELL-1数据集上,采用CTC贪心解码进行推理(未使用任何语言模型),我们在开发集和测试集上分别取得了3.64%和3.94%的词错误率(CER)的当前最优性能,相较于基线CTC-ASR系统分别实现了34.18%和34.88%的相对提升。