While Large Language Models (LLMs) have demonstrated commendable performance across a myriad of domains and tasks, existing LLMs still exhibit a palpable deficit in handling multimodal functionalities, especially for the Spoken Question Answering (SQA) task which necessitates precise alignment and deep interaction between speech and text features. To address the SQA challenge on LLMs, we initially curated the free-form and open-ended LibriSQA dataset from Librispeech, comprising Part I with natural conversational formats and Part II encompassing multiple-choice questions followed by answers and analytical segments. Both parts collectively include 107k SQA pairs that cover various topics. Given the evident paucity of existing speech-text LLMs, we propose a lightweight, end-to-end framework to execute the SQA task on the LibriSQA, witnessing significant results. By reforming ASR into the SQA format, we further substantiate our framework's capability in handling ASR tasks. Our empirical findings bolster the LLMs' aptitude for aligning and comprehending multimodal information, paving the way for the development of universal multimodal LLMs. The dataset and demo can be found at https://github.com/ZihanZhaoSJTU/LibriSQA.
翻译:尽管大型语言模型(LLMs)在众多领域和任务中展现出卓越性能,但现有LLMs在处理多模态功能时仍存在明显不足,尤其对于需要语音与文本特征精确对齐与深度交互的语音问答(SQA)任务。为应对LLMs面临的SQA挑战,我们首先基于Librispeech构建了自由形式且开放式的LibriSQA数据集,其中第一部分包含自然对话格式,第二部分涵盖带答案与分析片段的多选题。两部分共包含涵盖多种主题的107K个SQA问答对。鉴于现有语音-文本LLMs的明显匮乏,我们提出了一种轻量级端到端框架,在LibriSQA上执行SQA任务并取得显著成果。通过将自动语音识别(ASR)任务重构为SQA格式,我们进一步验证了该框架处理ASR任务的能力。实验发现增强了LLMs对齐与理解多模态信息的潜能,为发展通用多模态LLMs奠定了基础。数据集与演示示例可在https://github.com/ZihanZhaoSJTU/LibriSQA获取。