Large language models (LLMs) perform well on well-posed questions, yet standard question-answering (QA) benchmarks remain far from solved. We argue that this gap is partly due to underspecified questions - queries whose interpretation cannot be uniquely determined without additional context. To test this hypothesis, we introduce an LLM-based classifier to identify underspecified questions and apply it to several widely used QA datasets, finding that 16% to over 50% of benchmark questions are underspecified and that LLMs perform significantly worse on them. To isolate the effect of underspecification, we conduct a controlled rewriting experiment that serves as an upper-bound analysis, rewriting underspecified questions into fully specified variants while holding gold answers fixed. QA performance consistently improves under this setting, indicating that many apparent QA failures stem from question underspecification rather than model limitations. Our findings highlight underspecification as an important confound in QA evaluation and motivate greater attention to question clarity in benchmark design.
翻译:大型语言模型(LLMs)在表述明确的问题上表现良好,然而标准的问答(QA)基准测试远未得到解决。我们认为这一差距部分源于模糊问题——那些缺乏额外语境时无法唯一确定其含义的查询。为验证这一假设,我们引入了一种基于LLM的分类器来识别模糊问题,并将其应用于多个广泛使用的QA数据集,发现16%至超过50%的基准问题存在模糊性,且LLMs在这些问题上的表现显著更差。为分离模糊性的影响,我们进行了一项受控改写实验作为上限分析,将模糊问题改写为完全明确的变体,同时保持标准答案不变。在此设定下,QA性能持续提升,表明许多表面上的QA失败源于问题模糊性而非模型局限性。我们的研究结果突显了模糊性作为QA评估中的重要干扰因素,并激励在基准设计中更加关注问题表述的清晰性。