This paper investigates a novel approach to end-to-end speech translation (ST) based on aligning frozen pre-trained automatic speech recognition (ASR) and machine translation (MT) models via a small connector module (Q-Former, our Subsampler-Transformer Encoder). This connector bridges the gap between the speech and text modalities, transforming ASR encoder embeddings into the latent representation space of the MT encoder while being the only part of the system optimized during training. Experiments are conducted on the How2 English-Portuguese dataset as we investigate the alignment approach in a small-scale scenario focusing on ST. While keeping the size of the connector module constant and small in comparison ( < 5% of the size of the larger aligned models), increasing the size and capability of the foundation ASR and MT models universally improves translation results. We also find that the connectors can serve as domain adapters for the foundation MT models, significantly improving translation performance in the aligned ST setting. We conclude that this approach represents a viable and scalable approach to training end-to-end ST systems.
翻译:本文研究了一种新颖的端到端语音翻译方法,该方法通过小型连接模块(Q-Former,即我们的子采样器-Transformer编码器)将冻结的预训练自动语音识别模型与机器翻译模型进行对齐。该连接模块弥合了语音与文本模态之间的鸿沟,将ASR编码器的嵌入向量转换为MT编码器的潜在表示空间,同时是训练过程中唯一被优化的系统组件。我们在How2英语-葡萄牙语数据集上进行了实验,重点在语音翻译的小规模场景中探究这种对齐方法。在保持连接模块规模恒定且相对较小(小于被对齐大型模型规模的5%)的情况下,提升基础ASR与MT模型的规模和能力能普遍改善翻译结果。我们还发现连接模块可作为基础MT模型的领域适配器,在对齐的语音翻译设置中显著提升翻译性能。我们得出结论:该方法代表了一种可行且可扩展的端到端语音翻译系统训练方案。