Evaluating Open-Domain Question Answering in the Era of Large Language Models

Lexical matching remains the de facto evaluation method for open-domain question answering (QA). Unfortunately, lexical matching fails completely when a plausible candidate answer does not appear in the list of gold answers, which is increasingly the case as we shift from extractive to generative models. The recent success of large language models (LLMs) for QA aggravates lexical matching failures since candidate answers become longer, thereby making matching with the gold answers even more challenging. Without accurate evaluation, the true progress in open-domain QA remains unknown. In this paper, we conduct a thorough analysis of various open-domain QA models, including LLMs, by manually evaluating their answers on a subset of NQ-open, a popular benchmark. Our assessments reveal that while the true performance of all models is significantly underestimated, the performance of the InstructGPT (zero-shot) LLM increases by nearly +60%, making it on par with existing top models, and the InstructGPT (few-shot) model actually achieves a new state-of-the-art on NQ-open. We also find that more than 50% of lexical matching failures are attributed to semantically equivalent answers. We further demonstrate that regex matching ranks QA models consistent with human judgments, although still suffering from unnecessary strictness. Finally, we demonstrate that automated evaluation models are a reasonable surrogate for lexical matching in some circumstances, but not for long-form answers generated by LLMs. The automated models struggle in detecting hallucinations in LLM answers and are thus unable to evaluate LLMs. At this time, there appears to be no substitute for human evaluation.

翻译：词汇匹配仍是开放域问答（QA）评估的默认方法。然而，当合理的候选答案未出现在标准答案列表中时，词汇匹配会完全失效——随着我们从抽取式模型转向生成式模型，这种情况正日益普遍。大语言模型在问答任务中的近期成功加剧了词汇匹配的失败，因为候选答案变得更长，使得与标准答案的匹配更具挑战性。缺乏准确评估，开放域问答的真实进展将无从知晓。本文通过对流行基准NQ-open子集进行人工答案评估，全面分析了包括大语言模型在内的多种开放域问答模型。评估结果显示：尽管所有模型的真实性能都被显著低估，但InstructGPT（零样本）大语言模型的性能提升近+60%，使其与现有顶尖模型持平，而InstructGPT（少样本）模型实际上在NQ-open上达到了新的最优水平。我们还发现超过50%的词汇匹配失败源于语义等价的答案。进一步研究表明，正则表达式匹配对问答模型的排序与人类判断一致，尽管仍存在不必要的严格性。最后，我们证明自动评估模型在某些情况下可作为词汇匹配的合理替代，但无法用于评估大语言模型生成的长格式答案——这些自动化模型难以检测大语言模型答案中的幻觉，因此无法胜任对LLM的评估。目前看来，人工评估仍不可替代。