Large language models (LLMs), known for their exceptional reasoning capabilities, generalizability, and fluency across diverse domains, present a promising avenue for enhancing speech-related tasks. In this paper, we focus on integrating decoder-only LLMs to the task of speech-to-text translation (S2TT). We propose a decoder-only architecture that enables the LLM to directly consume the encoded speech representation and generate the text translation. Additionally, we investigate the effects of different parameter-efficient fine-tuning techniques and task formulation. Our model achieves state-of-the-art performance on CoVoST 2 and FLEURS among models trained without proprietary data. We also conduct analyses to validate the design choices of our proposed model and bring insights to the integration of LLMs to S2TT.
翻译:大型语言模型(LLM)以其卓越的推理能力、泛化能力以及在多领域的流畅性而闻名,为增强语音相关任务提供了前景广阔的途径。本文聚焦于将仅解码器LLM集成到语音到文本翻译(S2TT)任务中。我们提出了一种仅解码器架构,使LLM能够直接处理编码后的语音表征并生成文本翻译。此外,我们研究了不同参数高效微调技术和任务表述方式的影响。在未使用专有数据训练的模型中,我们的模型在CoVoST 2和FLEURS数据集上取得了最先进的性能。我们还进行了分析以验证所提出模型的设计选择,并为LLM与S2TT的集成提供了见解。