Large language models (LLMs) have achieved remarkable success in the field of natural language processing, enabling better human-computer interaction using natural language. However, the seamless integration of speech signals into LLMs has not been explored well. The "decoder-only" architecture has also not been well studied for speech processing tasks. In this research, we introduce Speech-LLaMA, a novel approach that effectively incorporates acoustic information into text-based large language models. Our method leverages Connectionist Temporal Classification and a simple audio encoder to map the compressed acoustic features to the continuous semantic space of the LLM. In addition, we further probe the decoder-only architecture for speech-to-text tasks by training a smaller scale randomly initialized speech-LLaMA model from speech-text paired data alone. We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines, highlighting the potential advantages of decoder-only models for speech-to-text conversion.
翻译:大型语言模型(LLMs)已在自然语言处理领域取得显著成功,实现了基于自然语言的更优人机交互。然而,将语音信号无缝集成到LLMs中的探索尚不充分。此外,“仅解码器”架构在语音处理任务中的研究也较为薄弱。本研究提出Speech-LLaMA,一种将声学信息有效融入基于文本的大型语言模型的新方法。该方法利用联结时序分类和简易音频编码器,将压缩后的声学特征映射至LLM的连续语义空间。进一步,我们仅使用语音-文本配对数据训练随机初始化的较小规模Speech-LLaMA模型,深入探索仅解码器架构在语音到文本任务中的应用。在多语言语音到文本翻译任务上的实验表明,本方法相较于强基线模型实现了显著提升,揭示了仅解码器模型在语音到文本转换中的潜在优势。