Existing Scholarly Question Answering (QA) methods typically target homogeneous data sources, relying solely on either text or Knowledge Graphs (KGs). However, scholarly information often spans heterogeneous sources, necessitating the development of QA systems that integrate information from multiple heterogeneous data sources. To address this challenge, we introduce Hybrid-SQuAD (Hybrid Scholarly Question Answering Dataset), a novel large-scale QA dataset designed to facilitate answering questions incorporating both text and KG facts. The dataset consists of 10.5K question-answer pairs generated by a large language model, leveraging the KGs DBLP and SemOpenAlex alongside corresponding text from Wikipedia. In addition, we propose a RAG-based baseline hybrid QA model, achieving an exact match score of 69.65 on the Hybrid-SQuAD test set.
翻译:现有的学术问答方法通常针对同质数据源,仅依赖纯文本或知识图谱。然而,学术信息往往分布于异构来源中,这要求开发能够整合多源异构数据的问答系统。为应对这一挑战,我们提出了Hybrid-SQuAD(混合式学术问答数据集),这是一个新颖的大规模问答数据集,旨在促进融合文本与知识图谱事实的问答任务。该数据集包含由大语言模型生成的10.5K个问答对,其构建同时利用了DBLP和SemOpenAlex知识图谱以及维基百科中的对应文本。此外,我们提出了一种基于检索增强生成的基线混合问答模型,该模型在Hybrid-SQuAD测试集上取得了69.65的精确匹配分数。