Aligning Pre-trained Models for Spoken Language Translation

This paper investigates a novel approach to end-to-end speech translation (ST) based on aligning frozen pre-trained automatic speech recognition (ASR) and machine translation (MT) models via a small connector module (Q-Former, our Subsampler-Transformer Encoder). This connector bridges the gap between the speech and text modalities, transforming ASR encoder embeddings into the latent representation space of the MT encoder while being the only part of the system optimized during training. Experiments are conducted on the How2 English-Portuguese dataset as we investigate the alignment approach in a small-scale scenario focusing on ST. While keeping the size of the connector module constant and small in comparison ( < 5% of the size of the larger aligned models), increasing the size and capability of the foundation ASR and MT models universally improves translation results. We also find that the connectors can serve as domain adapters for the foundation MT models, significantly improving translation performance in the aligned ST setting. We conclude that this approach represents a viable and scalable approach to training end-to-end ST systems.

翻译：本文研究了一种新颖的端到端语音翻译方法，该方法通过小型连接模块（Q-Former，即我们的子采样器-Transformer编码器）将冻结的预训练自动语音识别模型与机器翻译模型进行对齐。该连接模块弥合了语音与文本模态之间的鸿沟，将ASR编码器的嵌入向量转换为MT编码器的潜在表示空间，同时是训练过程中唯一被优化的系统组件。我们在How2英语-葡萄牙语数据集上进行了实验，重点在语音翻译的小规模场景中探究这种对齐方法。在保持连接模块规模恒定且相对较小（小于被对齐大型模型规模的5%）的情况下，提升基础ASR与MT模型的规模和能力能普遍改善翻译结果。我们还发现连接模块可作为基础MT模型的领域适配器，在对齐的语音翻译设置中显著提升翻译性能。我们得出结论：该方法代表了一种可行且可扩展的端到端语音翻译系统训练方案。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/