Given the great success of large language models (LLMs) across various tasks, in this paper, we introduce LLM-ST, a novel and effective speech translation model constructed upon a pre-trained LLM. By integrating the large language model (LLM) with a speech encoder and employing multi-task instruction tuning, LLM-ST can produce accurate timestamped transcriptions and translations, even from long audio inputs. Furthermore, our findings indicate that the implementation of Chain-of-Thought (CoT) prompting can yield advantages in the context of LLM-ST. Through rigorous experimentation on English and Chinese datasets, we showcase the exceptional performance of LLM-ST, establishing a new benchmark in the field of speech translation. Demo: https://speechtranslation.github.io/llm-st/.
翻译:鉴于大语言模型(LLM)在各种任务中取得的巨大成功,本文介绍了一种新颖且高效的语音翻译模型LLM-ST,该模型基于预训练的大语言模型构建。通过将大语言模型与语音编码器结合,并采用多任务指令微调,LLM-ST能够从长音频输入中生成带有时间戳的精确转录和翻译。此外,我们的研究发现,在LLM-ST的上下文中实施链式思维(CoT)提示可以带来优势。通过在英文和中文数据集上的严格实验,我们展示了LLM-ST的卓越性能,为语音翻译领域树立了新的基准。演示地址:https://speechtranslation.github.io/llm-st/。