Tuning Large language model for End-to-end Speech Translation

With the emergence of large language models (LLMs), multimodal models based on LLMs have demonstrated significant potential. Models such as LLaSM, X-LLM, and SpeechGPT exhibit an impressive ability to comprehend and generate human instructions. However, their performance often falters when faced with complex tasks like end-to-end speech translation (E2E-ST), a cross-language and cross-modal translation task. In comparison to single-modal models, multimodal models lag behind in these scenarios. This paper introduces LST, a Large multimodal model designed to excel at the E2E-ST task. LST consists of a speech frontend, an adapter, and a LLM backend. The training of LST consists of two stages: (1) Modality adjustment, where the adapter is tuned to align speech representation with text embedding space, and (2) Downstream task fine-tuning, where both the adapter and LLM model are trained to optimize performance on the E2EST task. Experimental results on the MuST-C speech translation benchmark demonstrate that LST-13B achieves BLEU scores of 30.39/41.55/35.33 on En-De/En-Fr/En-Es language pairs, surpassing previous models and establishing a new state-of-the-art. Additionally, we conduct an in-depth analysis of single-modal model selection and the impact of training strategies, which lays the foundation for future research. We will open up our code and models after review.

翻译：随着大语言模型（LLM）的兴起，基于LLM的多模态模型展现出巨大潜力。LLaSM、X-LLM和SpeechGPT等模型在理解和生成人类指令方面表现出色。然而，在面对诸如端到端语音翻译（E2E-ST）这类跨语言、跨模态的复杂任务时，它们的性能往往不尽如人意。与单模态模型相比，多模态模型在这些场景中仍有差距。本文介绍了LST——一种专为E2E-ST任务设计的强大多模态模型。LST由语音前端、适配器和LLM后端三部分组成。其训练分为两个阶段：（1）模态调整阶段，对适配器进行调优，使语音表示与文本嵌入空间对齐；（2）下游任务微调阶段，对适配器和LLM模型进行联合训练，以优化E2E-ST任务性能。在MuST-C语音翻译基准上的实验结果表明，LST-13B在英-德、英-法、英-西语言对上分别取得30.39/41.55/35.33的BLEU分数，超越了先前模型并达到了新最优水平。此外，我们对单模态模型选择及训练策略的影响进行了深入分析，为未来研究奠定了基础。我们将在审查后公开代码和模型。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日