While Large Language Models (LLMs) have demonstrated commendable performance across a myriad of domains and tasks, existing LLMs still exhibit a palpable deficit in handling multimodal functionalities, especially for the Spoken Question Answering (SQA) task which necessitates precise alignment and deep interaction between speech and text features. To address the SQA challenge on LLMs, we initially curated the free-form and open-ended LibriSQA dataset from Librispeech, comprising Part I with natural conversational formats and Part II encompassing multiple-choice questions followed by answers and analytical segments. Both parts collectively include 107k SQA pairs that cover various topics. Given the evident paucity of existing speech-text LLMs, we propose a lightweight, end-to-end framework to execute the SQA task on the LibriSQA, witnessing significant results. By reforming ASR into the SQA format, we further substantiate our framework's capability in handling ASR tasks. Our empirical findings bolster the LLMs' aptitude for aligning and comprehending multimodal information, paving the way for the development of universal multimodal LLMs. The dataset and demo can be found at https://github.com/ZihanZhaoSJTU/LibriSQA.
翻译:尽管大型语言模型(LLMs)已在众多领域和任务中展现出卓越性能,但现有LLMs在处理多模态功能时仍存在明显不足,尤其是在口语问答(SQA)任务中——该任务要求语音与文本特征实现精确对齐与深度交互。为应对LLMs面临的SQA挑战,我们首先从Librispeech中构建了自由形式且开放式的LibriSQA数据集,包含涵盖自然对话格式的第一部分以及由多项选择题及其答案与分析片段组成的第二部分。两个部分共包含涵盖多种话题的10.7万个SQA对。鉴于现有语音-文本LLMs的明显匮乏,我们提出一种轻量级端到端框架以在LibriSQA上执行SQA任务,并取得了显著成果。通过将自动语音识别(ASR)任务重构为SQA格式,我们进一步验证了该框架处理ASR任务的能力。实证结果表明,LLMs能够有效对齐与理解多模态信息,为构建通用多模态LLMs铺平了道路。数据集与演示示例详见https://github.com/ZihanZhaoSJTU/LibriSQA。