Large language models (LLMs) have achieved remarkable success in the field of natural language processing, enabling better human-computer interaction using natural language. However, the seamless integration of speech signals into LLMs has not been explored well. The "decoder-only" architecture has also not been well studied for speech processing tasks. In this research, we introduce Speech-LLaMA, a novel approach that effectively incorporates acoustic information into text-based large language models. Our method leverages Connectionist Temporal Classification and a simple audio encoder to map the compressed acoustic features to the continuous semantic space of the LLM. In addition, we further probe the decoder-only architecture for speech-to-text tasks by training a smaller scale randomly initialized speech-LLaMA model from speech-text paired data alone. We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines, highlighting the potential advantages of decoder-only models for speech-to-text conversion.
翻译:大型语言模型(LLMs)在自然语言处理领域取得了显著成功,使得利用自然语言实现更优的人机交互成为可能。然而,语音信号与LLMs的无缝集成尚未得到充分探索。同样,“仅解码器”架构在语音处理任务中的研究也尚不深入。本研究中,我们提出Speech-LLaMA,一种将声学信息有效整合到基于文本的大语言模型中的创新方法。该方法利用连接主义时间分类与一个简单的音频编码器,将压缩后的声学特征映射至LLM的连续语义空间。此外,我们仅基于语音-文本配对数据训练一个小规模随机初始化的Speech-LLaMA模型,进一步探究仅解码器架构在语音转文本任务中的潜力。我们针对多语言语音转文本翻译任务进行了实验,结果表明该方法相较于强基线模型有显著提升,凸显了仅解码器模型在语音转文本转换中的潜在优势。