Despite extensive research on a wide range of question answering (QA) systems, most existing work focuses on answer containment-i.e., assuming that answers can be directly extracted and/or generated from documents in the corpus. However, some questions require inference, i.e., deriving answers that are not explicitly stated but can be inferred from the available information. We introduce Inferential QA -- a new task that challenges models to infer answers from answer-supporting passages which provide only clues. To study this problem, we construct QUIT (QUestions requiring Inference from Texts) dataset, comprising 7,401 questions and 2.4M passages built from high-convergence human- and machine-authored hints, labeled across three relevance levels using LLM-based answerability and human verification. Through comprehensive evaluation of retrievers, rerankers, and LLM-based readers, we show that methods effective on traditional QA tasks struggle in inferential QA: retrievers underperform, rerankers offer limited gains, and fine-tuning provides inconsistent improvements. Even reasoning-oriented LLMs fail to outperform smaller general-purpose models. These findings reveal that current QA pipelines are not yet ready for inference-based reasoning. Inferential QA thus establishes a new class of QA tasks that move towards understanding and reasoning from indirect textual evidence.
翻译:尽管针对各类问答系统已开展了广泛研究,但现有工作大多聚焦于答案包含性——即假设答案可直接从语料库文档中抽取和/或生成。然而,某些问题需要推理,即推导出未明确陈述但可从已有信息中推断出的答案。本文提出推理式问答——这项新任务要求模型从仅提供线索的答案支持段落中进行推理作答。为研究该问题,我们构建了QUIT数据集,包含7,401个问题与240万条段落,这些数据源自高收敛度的人工撰写与机器生成提示,并基于LLM的可答性评估及人工校验标注了三个相关性等级。通过对检索器、重排序器和基于LLM的阅读器进行综合评估,我们发现传统问答任务的有效方法在推理式问答中表现不佳:检索器性能不足,重排序器提升有限,微调带来的改进也不稳定。即使是面向推理的大语言模型也未能超越更小的通用模型。这些发现表明当前问答流程尚未具备基于推理的推断能力。推理式问答由此确立了一类新的问答任务范式,推动着对间接文本证据的理解与推理研究。