While Large Language Models (LLMs) have demonstrated commendable performance across a myriad of domains and tasks, existing LLMs still exhibit a palpable deficit in handling multimodal functionalities, especially for the Spoken Question Answering (SQA) task which necessitates precise alignment and deep interaction between speech and text features. To address the SQA challenge on LLMs, we initially curated the free-form and open-ended LibriSQA dataset from Librispeech, comprising Part I with natural conversational formats and Part II encompassing multiple-choice questions followed by answers and analytical segments. Both parts collectively include 107k SQA pairs that cover various topics. Given the evident paucity of existing speech-text LLMs, we propose a lightweight, end-to-end framework to execute the SQA task on the LibriSQA, witnessing significant results. By reforming ASR into the SQA format, we further substantiate our framework's capability in handling ASR tasks. Our empirical findings bolster the LLMs' aptitude for aligning and comprehending multimodal information, paving the way for the development of universal multimodal LLMs. The dataset and demo can be found at https://github.com/ZihanZhaoSJTU/LibriSQA.
翻译:尽管大型语言模型(LLMs)已在众多领域和任务中展现出卓越性能,但现有LLMs在处理多模态功能方面仍存在明显不足,尤其是需要语音与文本特征精确对齐和深度交互的口语问答任务。为解决LLM在口语问答领域的挑战,我们首先从Librispeech语料库中构建了自由形式且开放式的LibriSQA数据集,包含第一部分自然对话格式和第二部分由多选题、答案及分析段落组成的样本。两部分合计涵盖10.7万个涉及多种话题的口语问答对。鉴于现有语音-文本LLM的明显匮乏,我们提出了一种轻量级端到端框架,在LibriSQA上执行口语问答任务并取得显著效果。通过将自动语音识别转化为口语问答格式,进一步验证了本框架处理自动语音识别任务的能力。实验发现强化了LLM对齐和理解多模态信息的能力,为开发通用多模态LLM奠定了基础。数据集和演示详见https://github.com/ZihanZhaoSJTU/LibriSQA。