Question answering (QA) can only make progress if we know if an answer is correct, but for many of the most challenging and interesting QA examples, current evaluation metrics to determine answer equivalence (AE) often do not align with human judgments, particularly more verbose, free-form answers from large language models (LLM). There are two challenges: a lack of data and that models are too big: LLM-based scorers can correlate better with human judges, but this task has only been tested on limited QA datasets, and even when available, update of the model is limited because LLMs are large and often expensive. We rectify both of these issues by providing clear and consistent guidelines for evaluating AE in machine QA adopted from professional human QA contests. We also introduce a combination of standard evaluation and a more efficient, robust, and lightweight discriminate AE classifier-based matching method (CFMatch, smaller than 1 MB), trained and validated to more accurately evaluate answer correctness in accordance with adopted expert AE rules that are more aligned with human judgments.
翻译:问答(QA)只有在我们能够判断答案是否正确时才能取得进展。然而,对于许多最具挑战性和趣味性的QA示例,当前用于判定答案等价性(AE)的评估指标往往与人类判断不一致,尤其是针对大语言模型(LLM)生成的更冗长、自由形式的答案。这存在两大挑战:数据匮乏以及模型规模过大——基于LLM的评分器虽能与人类评判者实现更好相关性,但该任务仅在有限QA数据集上得到测试;即便具备可用性,由于LLM规模庞大且成本高昂,模型更新也受到限制。我们通过以下两点解决上述问题:其一,提供从专业人类QA竞赛中借鉴的清晰一致的AE评估准则;其二,引入一种结合标准评估方法、更高效、鲁棒且轻量的判别式AE分类器匹配方法(CFMatch,体积小于1 MB),该方法经过训练与验证,能依据更符合人类判断的专家AE规则更准确地评估答案正确性。