Question answering (QA) can only make progress if we know if an answer is correct, but for many of the most challenging and interesting QA examples, current evaluation metrics to determine answer equivalence (AE) often do not align with human judgments, particularly more verbose, free-form answers from large language models (LLM). There are two challenges: a lack of data and that models are too big: LLM-based scorers can correlate better with human judges, but this task has only been tested on limited QA datasets, and even when available, update of the model is limited because LLMs are large and often expensive. We rectify both of these issues by providing clear and consistent guidelines for evaluating AE in machine QA adopted from professional human QA contests. We also introduce a combination of standard evaluation and a more efficient, robust, and lightweight discriminate AE classifier-based matching method (CFMatch, smaller than 1 MB), trained and validated to more accurately evaluate answer correctness in accordance with adopted expert AE rules that are more aligned with human judgments.
翻译:问答(QA)领域的进展取决于能否判断答案的正确性,然而对于许多最具挑战性和趣味性的问答实例,当前用于确定答案等价性(AE)的评估指标往往与人类判断不一致,特别是针对大语言模型(LLM)生成的更冗长、更自由形式的回答。这面临两大挑战:数据匮乏以及模型过于庞大。基于LLM的评分器虽能与人类评判者的相关性更高,但该任务仅在有限的问答数据集上得到测试;即便数据可用,由于LLM规模庞大且通常成本高昂,其模型更新也受到限制。我们通过制定从专业人类问答竞赛中借鉴的清晰一致的AE评估指南,同时解决了这两个问题。我们还引入了一种结合标准评估方法且更高效、鲁棒、轻量级的判别式AE分类器匹配方法(CFMatch,小于1 MB),该方法经过训练和验证,能够依据与人类判断更一致的专家AE规则,更准确地评估答案正确性。