With the emergence of large language models (LLMs), multimodal models based on LLMs have demonstrated significant potential. Models such as LLaSM, X-LLM, and SpeechGPT exhibit an impressive ability to comprehend and generate human instructions. However, their performance often falters when faced with complex tasks like end-to-end speech translation (E2E-ST), a cross-language and cross-modal translation task. In comparison to single-modal models, multimodal models lag behind in these scenarios. This paper introduces LST, a Large multimodal model designed to excel at the E2E-ST task. LST consists of a speech frontend, an adapter, and a LLM backend. The training of LST consists of two stages: (1) Modality adjustment, where the adapter is tuned to align speech representation with text embedding space, and (2) Downstream task fine-tuning, where both the adapter and LLM model are trained to optimize performance on the E2EST task. Experimental results on the MuST-C speech translation benchmark demonstrate that LST-13B achieves BLEU scores of 30.39/41.55/35.33 on En-De/En-Fr/En-Es language pairs, surpassing previous models and establishing a new state-of-the-art. Additionally, we conduct an in-depth analysis of single-modal model selection and the impact of training strategies, which lays the foundation for future research. We will open up our code and models after review.
翻译:随着大语言模型(LLM)的兴起,基于LLM的多模态模型展现出巨大潜力。LLaSM、X-LLM和SpeechGPT等模型在理解和生成人类指令方面表现出色。然而,在面对诸如端到端语音翻译(E2E-ST)这类跨语言、跨模态的复杂任务时,它们的性能往往不尽如人意。与单模态模型相比,多模态模型在这些场景中仍有差距。本文介绍了LST——一种专为E2E-ST任务设计的强大多模态模型。LST由语音前端、适配器和LLM后端三部分组成。其训练分为两个阶段:(1)模态调整阶段,对适配器进行调优,使语音表示与文本嵌入空间对齐;(2)下游任务微调阶段,对适配器和LLM模型进行联合训练,以优化E2E-ST任务性能。在MuST-C语音翻译基准上的实验结果表明,LST-13B在英-德、英-法、英-西语言对上分别取得30.39/41.55/35.33的BLEU分数,超越了先前模型并达到了新最优水平。此外,我们对单模态模型选择及训练策略的影响进行了深入分析,为未来研究奠定了基础。我们将在审查后公开代码和模型。