Evaluating Open-Domain Question Answering in the Era of Large Language Models

Lexical matching remains the de facto evaluation method for open-domain question answering (QA). Unfortunately, lexical matching fails completely when a plausible candidate answer does not appear in the list of gold answers, which is increasingly the case as we shift from extractive to generative models. The recent success of large language models (LLMs) for QA aggravates lexical matching failures since candidate answers become longer, thereby making matching with the gold answers even more challenging. Without accurate evaluation, the true progress in open-domain QA remains unknown. In this paper, we conduct a thorough analysis of various open-domain QA models, including LLMs, by manually evaluating their answers on a subset of NQ-open, a popular benchmark. Our assessments reveal that while the true performance of all models is significantly underestimated, the performance of the InstructGPT (zero-shot) LLM increases by nearly +60%, making it on par with existing top models, and the InstructGPT (few-shot) model actually achieves a new state-of-the-art on NQ-open. We also find that more than 50% of lexical matching failures are attributed to semantically equivalent answers. We further demonstrate that regex matching ranks QA models consistent with human judgments, although still suffering from unnecessary strictness. Finally, we demonstrate that automated evaluation models are a reasonable surrogate for lexical matching in some circumstances, but not for long-form answers generated by LLMs. The automated models struggle in detecting hallucinations in LLM answers and are thus unable to evaluate LLMs. At this time, there appears to be no substitute for human evaluation.

翻译：词汇匹配仍然是开放域问答（QA）的事实评估方法。然而，当一个合理的候选答案未出现在标准答案列表中时，词汇匹配会完全失效，而随着我们从抽取式模型转向生成式模型，这种情况愈发普遍。大型语言模型（LLMs）在问答领域的近期成功加剧了词汇匹配的失败，因为候选答案变得更长，从而使得与标准答案的匹配更加困难。缺乏准确的评估，开放域问答的真正进展仍属未知。本文通过对NQ-open（一个常用基准数据集）的子集进行人工评估，全面分析了包括LLMs在内的多种开放域问答模型。我们的评估揭示：尽管所有模型的真实性能被显著低估，但InstructGPT（零样本）LLM的性能提升了近+60%，使其与现有顶尖模型持平，而InstructGPT（少样本）模型实际上在NQ-open上达到了新的最优水平。我们还发现超过50%的词汇匹配失败归因于语义等价的答案。进一步研究表明，正则表达式匹配对问答模型的排序与人类判断一致，尽管仍存在不必要的严格性。最后，我们证明在某些情况下，自动化评估模型可作为词汇匹配的合理替代方案，但无法用于评估LLMs生成的长篇答案。自动化模型难以检测LLM答案中的幻觉，因此无法评估LLMs。目前，似乎尚无替代人工评估的方法。