Scientific literature is typically dense, requiring significant background knowledge and deep comprehension for effective engagement. We introduce SciDQA, a new dataset for reading comprehension that challenges LLMs for a deep understanding of scientific articles, consisting of 2,937 QA pairs. Unlike other scientific QA datasets, SciDQA sources questions from peer reviews by domain experts and answers by paper authors, ensuring a thorough examination of the literature. We enhance the dataset's quality through a process that carefully filters out lower quality questions, decontextualizes the content, tracks the source document across different versions, and incorporates a bibliography for multi-document question-answering. Questions in SciDQA necessitate reasoning across figures, tables, equations, appendices, and supplementary materials, and require multi-document reasoning. We evaluate several open-source and proprietary LLMs across various configurations to explore their capabilities in generating relevant and factual responses. Our comprehensive evaluation, based on metrics for surface-level similarity and LLM judgements, highlights notable performance discrepancies. SciDQA represents a rigorously curated, naturally derived scientific QA dataset, designed to facilitate research on complex scientific text understanding.
翻译:科学文献通常内容密集,需要深厚的背景知识和深度理解才能有效阅读。我们提出了SciDQA,这是一个用于阅读理解的新数据集,旨在挑战大型语言模型对科学文章的深度理解能力,该数据集包含2,937个问答对。与其他科学问答数据集不同,SciDQA的问题来源于领域专家的同行评审意见,答案则由论文作者提供,从而确保了对文献的全面考察。我们通过一系列流程提升了数据集的质量:仔细筛选低质量问题、对内容进行去语境化处理、追踪不同版本的源文档,并纳入参考文献以支持多文档问答。SciDQA中的问题需要跨图表、方程、附录和补充材料进行推理,并要求多文档推理能力。我们评估了多种开源和专有大型语言模型在不同配置下的表现,以探索其生成相关且事实性回答的能力。基于表层相似性度量和大型语言模型判断的综合评估揭示了显著的性能差异。SciDQA是一个经过严格筛选、自然衍生的科学问答数据集,旨在促进复杂科学文本理解的研究。