Large Language Models (LLMs) have excelled in multi-hop question-answering (M-QA) due to their advanced reasoning abilities. However, the impact of the inherent reasoning structures on LLM M-QA performance remains unclear, largely due to the absence of QA datasets that provide fine-grained reasoning structures. To address this gap, we introduce the Graph Reasoning-Structured Question Answering Dataset (GRS-QA), which includes both semantic contexts and reasoning structures for QA pairs. Unlike existing M-QA datasets, where different reasoning structures are entangled together, GRS-QA explicitly captures intricate reasoning pathways by constructing reasoning graphs, where nodes represent textual contexts and edges denote logical flows. These reasoning graphs of different structures enable a fine-grained evaluation of LLM reasoning capabilities across various reasoning structures. Our empirical analysis reveals that LLMs perform differently when handling questions with varying reasoning structures. This finding facilitates the exploration of textual structures as compared with semantics.
翻译:大型语言模型(LLM)凭借其先进的推理能力,在多跳问答(M-QA)任务中表现出色。然而,固有的推理结构对LLM多跳问答性能的影响尚不明确,这主要是由于缺乏提供细粒度推理结构的问答数据集。为填补这一空白,我们提出了图推理结构化问答数据集(GRS-QA),该数据集为问答对同时提供了语义上下文和推理结构。与现有多跳问答数据集中不同推理结构相互纠缠不同,GRS-QA通过构建推理图来显式捕获复杂的推理路径——其中节点表示文本上下文,边表示逻辑流向。这些不同结构的推理图使得能够对LLM在不同推理结构上的推理能力进行细粒度评估。我们的实证分析表明,LLM在处理具有不同推理结构的问题时表现存在差异。这一发现有助于推动文本结构与语义的对比探索。