Current automated fact-checking (AFC) approaches commonly evaluate evidence either implicitly via the predicted verdicts or by comparing retrieved evidence with a predefined closed knowledge source, such as Wikipedia. However, these methods suffer from limitations, resulting from their reliance on evaluation metrics developed for different purposes and constraints imposed by closed knowledge sources. Recent advances in natural language generation (NLG) evaluation offer new possibilities for evidence assessment. In this work, we introduce Ev2R, an evaluation framework for AFC that comprises three types of approaches for evidence evaluation: reference-based, proxy-reference, and reference-less. We evaluate their effectiveness through agreement with human ratings and adversarial tests, and demonstrate that prompt-based scorers, particularly those leveraging LLMs and reference evidence, outperform traditional evaluation approaches.
翻译:当前的自动事实核查方法通常通过预测的核查结论隐式评估证据,或者通过将检索到的证据与预定义的封闭知识源(如维基百科)进行比较来评估。然而,这些方法存在局限性,源于其对为不同目的开发的评估指标的依赖以及封闭知识源施加的约束。自然语言生成评估的最新进展为证据评估提供了新的可能性。在本工作中,我们提出了Ev2R,一个用于自动事实核查的评估框架,包含三种证据评估方法:基于参考的方法、代理参考方法和无参考方法。我们通过其与人工评分的吻合度以及对抗性测试来评估它们的有效性,并证明基于提示的评分器,特别是那些利用大语言模型和参考证据的评分器,优于传统的评估方法。