Large language models (LLMs) have achieved remarkable success in the field of natural language processing, enabling better human-computer interaction using natural language. However, the seamless integration of speech signals into LLMs has not been explored well. The "decoder-only" architecture has also not been well studied for speech processing tasks. In this research, we introduce Speech-LLaMA, a novel approach that effectively incorporates acoustic information into text-based large language models. Our method leverages Connectionist Temporal Classification and a simple audio encoder to map the compressed acoustic features to the continuous semantic space of the LLM. In addition, we further probe the decoder-only architecture for speech-to-text tasks by training a smaller scale randomly initialized speech-LLaMA model from speech-text paired data alone. We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines, highlighting the potential advantages of decoder-only models for speech-to-text conversion.
翻译:大型语言模型(LLMs)在自然语言处理领域取得了显著成功,实现了基于自然语言的更好人机交互。然而,语音信号与LLMs的无缝集成尚未得到充分探索。"仅解码器"架构在语音处理任务中的研究也尚不充分。在本研究中,我们提出Speech-LLaMA,一种将声学信息有效融入基于文本的大型语言模型的新方法。该方法利用连接主义时序分类和简单的音频编码器,将压缩后的声学特征映射到LLM的连续语义空间中。此外,我们进一步探究仅解码器架构在语音到文本任务中的应用,通过仅利用语音-文本配对数据训练一个规模较小的随机初始化Speech-LLaMA模型。我们在多语言语音到文本翻译任务上开展实验,结果表明相较于强基线方法有显著提升,突显了仅解码器模型在语音到文本转换中的潜在优势。