With the growing influence of Large Language Models (LLMs), there is increasing interest in integrating speech representations with them to enable more seamless multi-modal processing and speech understanding. This study introduces a novel approach that leverages self-supervised speech representations in combination with instruction-tuned LLMs for speech-to-text translation. The proposed approach leverages a modality adapter to align extracted speech features with instruction-tuned LLMs using English-language data. Our experiments demonstrate that this method effectively preserves the semantic content of the input speech and serves as an effective bridge between self-supervised speech models and instruction-tuned LLMs, offering a promising solution for various speech understanding applications.
翻译:随着大型语言模型(LLMs)影响力的日益增强,将语音表征与LLMs相结合以实现更无缝的多模态处理和语音理解引起了越来越多的关注。本研究提出了一种新颖方法,利用自监督语音表征与指令微调LLMs相结合,实现语音到文本的翻译。所提出的方法通过模态适配器,利用英语数据将提取的语音特征与指令微调LLMs对齐。实验结果表明,该方法能有效保留输入语音的语义内容,并在自监督语音模型与指令微调LLMs之间构建了有效的桥梁,为各类语音理解应用提供了有前景的解决方案。