In this work, we introduce a framework for speech summarization that leverages the processing and reasoning capabilities of large language models (LLMs). We propose an end-to-end system that combines an instruction-tuned LLM with an audio encoder that converts speech into token representations that the LLM can interpret. Using a dataset with paired speech-text data, the overall system is trained to generate consistent responses to prompts with the same semantic information regardless of the input modality. The resulting framework allows the LLM to process speech inputs in the same way as text, enabling speech summarization by simply prompting the LLM. Unlike prior approaches, our method is able to summarize spoken content from any arbitrary domain, and it can produce summaries in different styles by varying the LLM prompting strategy. Experiments demonstrate that our approach outperforms a cascade baseline of speech recognition followed by LLM text processing.
翻译:本研究提出了一种利用大型语言模型(LLM)处理与推理能力的语音摘要框架。我们设计了一个端到端系统,将指令微调的大型语言模型与音频编码器相结合,该编码器将语音转换为LLM可解释的标记表示。通过使用配对语音-文本数据集,整个系统被训练为能够对具有相同语义信息的提示生成一致响应,而不受输入模态的影响。该框架使得LLM能够以处理文本的相同方式处理语音输入,仅需对LLM进行提示即可实现语音摘要。与现有方法不同,我们的方法能够总结任意领域的口语内容,并可通过调整LLM提示策略生成不同风格的摘要。实验表明,该方法在性能上优于语音识别后接LLM文本处理的级联基线系统。