Large language models (LLMs) perform well on well-posed questions, yet standard question-answering (QA) benchmarks remain far from solved. We argue that this gap is partly due to underspecified questions - queries whose interpretation cannot be uniquely determined without additional context. To test this hypothesis, we introduce an LLM-based classifier to identify underspecified questions and apply it to several widely used QA datasets, finding that 16% to over 50% of benchmark questions are underspecified and that LLMs perform significantly worse on them. To isolate the effect of underspecification, we conduct a controlled rewriting experiment that serves as an upper-bound analysis, rewriting underspecified questions into fully specified variants while holding gold answers fixed. QA performance consistently improves under this setting, indicating that many apparent QA failures stem from question underspecification rather than model limitations. Our findings highlight underspecification as an important confound in QA evaluation and motivate greater attention to question clarity in benchmark design.
翻译:大型语言模型(LLMs)在表述清晰的问题上表现良好,然而标准的问答(QA)基准测试远未得到解决。我们认为这一差距部分源于未明确问题——即缺乏额外语境时其解释无法唯一确定的查询。为验证这一假设,我们引入了一种基于LLM的分类器来识别未明确问题,并将其应用于多个广泛使用的QA数据集,发现基准测试中16%至超过50%的问题属于未明确问题,且LLMs在此类问题上表现显著更差。为分离未明确性的影响,我们进行了受控重写实验作为上限分析,在保持标准答案不变的前提下,将未明确问题重写为完全明确的变体。在此设定下,QA性能持续提升,表明许多表面上的QA失败源于问题未明确性而非模型能力限制。我们的研究结果突显了未明确性作为QA评估中的重要干扰因素,并激励在基准设计中更加关注问题表述的清晰性。