Question answering (QA) can only make progress if we know if an answer is correct, but for many of the most challenging and interesting QA examples, current evaluation metrics to determine answer equivalence (AE) often do not align with human judgments, particularly more verbose, free-form answers from large language models (LLM). There are two challenges: a lack of data and that models are too big: LLM-based scorers can correlate better with human judges, but this task has only been tested on limited QA datasets, and even when available, update of the model is limited because LLMs are large and often expensive. We rectify both of these issues by providing clear and consistent guidelines for evaluating AE in machine QA adopted from professional human QA contests. We also introduce a combination of standard evaluation and a more efficient, robust, and lightweight discriminate AE classifier-based matching method (CFMatch, smaller than 1 MB), trained and validated to more accurately evaluate answer correctness in accordance with adopted expert AE rules that are more aligned with human judgments.
翻译:问答系统(QA)的发展依赖于对答案正确性的判断,然而对于当前最具挑战性和趣味性的问答实例,现有的答案等价性(AE)评估指标往往与人类判断不一致,特别是针对大型语言模型(LLM)生成的更冗长、自由形式的答案。这面临两大挑战:数据匮乏与模型规模过大。基于LLM的评分器虽能与人类评判者达成更高相关性,但该任务仅在有限的问答数据集上经过测试,且即使可用,由于LLM规模庞大且通常成本高昂,模型更新受限。我们通过引入从专业人类问答竞赛中采纳的、清晰一致的机器问答AE评估准则,同时解决了这两个问题。我们还提出了一种结合标准评估与更高效、鲁棒且轻量化的判别式AE分类器匹配方法(CFMatch,模型小于1 MB),该方法经过训练与验证,能够依据更贴合人类判断的专家AE规则,更准确地评估答案正确性。