Evaluating Open-Domain Question Answering in the Era of Large Language Models

Lexical matching remains the de facto evaluation method for open-domain question answering (QA). Unfortunately, lexical matching fails completely when a plausible candidate answer does not appear in the list of gold answers, which is increasingly the case as we shift from extractive to generative models. The recent success of large language models (LLMs) for QA aggravates lexical matching failures since candidate answers become longer, thereby making matching with the gold answers even more challenging. Without accurate evaluation, the true progress in open-domain QA remains unknown. In this paper, we conduct a thorough analysis of various open-domain QA models, including LLMs, by manually evaluating their answers on a subset of NQ-open, a popular benchmark. Our assessments reveal that while the true performance of all models is significantly underestimated, the performance of the InstructGPT (zero-shot) LLM increases by nearly +60%, making it on par with existing top models, and the InstructGPT (few-shot) model actually achieves a new state-of-the-art on NQ-open. We also find that more than 50% of lexical matching failures are attributed to semantically equivalent answers. We further demonstrate that regex matching ranks QA models consistent with human judgments, although still suffering from unnecessary strictness. Finally, we demonstrate that automated evaluation models are a reasonable surrogate for lexical matching in some circumstances, but not for long-form answers generated by LLMs. The automated models struggle in detecting hallucinations in LLM answers and are thus unable to evaluate LLMs. At this time, there appears to be no substitute for human evaluation.

翻译：词汇匹配仍是开放域问答（QA）的主流评估方法。然而，当合理的候选答案未出现在标准答案列表中时，词汇匹配会完全失效——随着我们从抽取式模型转向生成式模型，这一情况日益普遍。大语言模型（LLMs）在问答任务中的成功进一步加剧了词汇匹配的失效，因为候选答案变得更长，与标准答案的匹配难度随之增加。缺乏准确评估，开放域问答的真实进展便无从得知。本文通过人工评估NQ-open（一个广泛使用的基准数据集）子集的答案，对包括LLMs在内的多种开放域问答模型进行了深入分析。评估结果表明：尽管所有模型的真实性能都被显著低估，但InstructGPT（零样本）LLM的性能提升了近+60%，使其与现有顶尖模型持平；而InstructGPT（少样本）模型实际上在NQ-open上达到了新的最优水平。我们还发现，超过50%的词汇匹配失败可归因于语义等价的答案。进一步研究表明，正则表达式匹配能够使问答模型排序结果与人工判断一致，尽管仍存在不必要的严格性。最后，我们证明自动化评估模型在某些情况下可作为词汇匹配的合理替代方案，但无法有效评估LLMs生成的长文本答案——这些模型难以检测LLM答案中的幻觉现象，因此无法胜任LLM评估工作。目前看来，人工评估仍不可替代。