This paper investigates efficient methods for utilizing text-only data to improve speech recognition, focusing on encoder-dominated models that facilitate faster recognition. We provide a comprehensive comparison of techniques to integrate text-only data, including modality matching and dynamic downsampling to reach text-level representations within the encoder. Our experiments on the LibriSpeech corpus show that a larger encoder with a smaller decoder can equal or surpass the performance of architectures with larger decoders. We demonstrate that simple configurations, such as random duration models, are often more effective than complex alternatives, significantly simplifying the training pipeline. All code and recipes are made publicly available.
翻译:本文研究如何高效利用纯文本数据提升语音识别性能,重点关注可实现快速识别的编码器主导型模型。我们系统比较了整合纯文本数据的技术方案,包括模态匹配和动态降采样以在编码器内达到文本级表征。基于LibriSpeech语料库的实验表明,采用更大编码器与更小解码器的架构能够达到甚至超越具有更大解码器的架构性能。研究证实,简单配置(如随机时长模型)往往比复杂替代方案更有效,能显著简化训练流程。所有代码及实验配置均已公开提供。