Mapping two modalities, speech and text, into a shared representation space, is a research topic of using text-only data to improve end-to-end automatic speech recognition (ASR) performance in new domains. However, the length of speech representation and text representation is inconsistent. Although the previous method up-samples the text representation to align with acoustic modality, it may not match the expected actual duration. In this paper, we proposed novel representations match strategy through down-sampling acoustic representation to align with text modality. By introducing a continuous integrate-and-fire (CIF) module generating acoustic representations consistent with token length, our ASR model can learn unified representations from both modalities better, allowing for domain adaptation using text-only data of the target domain. Experiment results of new domain data demonstrate the effectiveness of the proposed method.
翻译:将语音和文本两种模态映射到共享表征空间,是利用纯文本数据改善端到端自动语音识别(ASR)在新领域性能的研究课题。然而,语音表征与文本表征的长度存在不一致性。尽管先前方法通过上采样文本表征以匹配声学模态,但可能无法准确对应预期的实际时长。本文提出一种新颖的表征匹配策略,即通过下采样声学表征来对齐文本模态。通过引入连续积分-触发(CIF)模块生成与词元长度一致的声学表征,ASR模型能更好地从两种模态学习统一表征,从而利用目标领域的纯文本数据实现领域自适应。新领域数据的实验结果验证了所提方法的有效性。