We present BERT-CTC-Transducer (BECTRA), a novel end-to-end automatic speech recognition (E2E-ASR) model formulated by the transducer with a BERT-enhanced encoder. Integrating a large-scale pre-trained language model (LM) into E2E-ASR has been actively studied, aiming to utilize versatile linguistic knowledge for generating accurate text. One crucial factor that makes this integration challenging lies in the vocabulary mismatch; the vocabulary constructed for a pre-trained LM is generally too large for E2E-ASR training and is likely to have a mismatch against a target ASR domain. To overcome such an issue, we propose BECTRA, an extended version of our previous BERT-CTC, that realizes BERT-based E2E-ASR using a vocabulary of interest. BECTRA is a transducer-based model, which adopts BERT-CTC for its encoder and trains an ASR-specific decoder using a vocabulary suitable for a target task. With the combination of the transducer and BERT-CTC, we also propose a novel inference algorithm for taking advantage of both autoregressive and non-autoregressive decoding. Experimental results on several ASR tasks, varying in amounts of data, speaking styles, and languages, demonstrate that BECTRA outperforms BERT-CTC by effectively dealing with the vocabulary mismatch while exploiting BERT knowledge.
翻译:本文提出BERT-CTC-换能器(BECTRA),一种由换能器结合BERT增强编码器构建的新型端到端自动语音识别(E2E-ASR)模型。将大规模预训练语言模型(LM)整合至E2E-ASR一直是活跃的研究方向,旨在利用通用语言知识生成准确文本。实现这种整合的关键难点在于词汇不匹配:预训练LM构建的词汇表通常过于庞大,无法直接用于E2E-ASR训练,且可能产生与目标ASR领域的不匹配。为解决该问题,我们提出BECTRA——在先前工作BERT-CTC基础上的扩展版本,通过使用目标词汇表实现基于BERT的E2E-ASR。BECTRA基于换能器架构,其编码器采用BERT-CTC,并利用适合目标任务的词汇表训练ASR专用解码器。通过换能器与BERT-CTC的结合,我们进一步提出一种新型推理算法,可同时利用自回归与非自回归解码的优势。在多个涵盖不同数据规模、说话风格及语言的ASR任务上的实验结果表明,BECTRA通过有效缓解词汇不匹配问题并充分挖掘BERT知识,整体性能优于BERT-CTC。