Fact verification is essential for ensuring the reliability of LLM applications. In this study, we evaluate 12 pre-trained LLMs and one specialized fact-verifier, including frontier LLMs and open-weight reasoning LLMs, using a collection of examples from 14 fact-checking benchmarks. We share three findings intended to guide future development of more robust fact verifiers. First, we highlight the importance of addressing annotation errors and ambiguity in datasets, demonstrating that approximately 16\% of ambiguous or incorrectly labeled data substantially influences model rankings. Neglecting this issue may result in misleading conclusions during comparative evaluations, and we suggest using a systematic pipeline utilizing LLM-as-a-judge to help identify these issues at scale. Second, we discover that frontier LLMs with few-shot in-context examples, often overlooked in previous works, achieve top-tier performance. We therefore recommend that future studies include comparisons with these simple yet highly effective baselines. Lastly, despite their effectiveness, frontier LLMs incur substantial costs, motivating the development of small, fine-tuned fact verifiers. We show that these small models still have room for improvement, particularly on instances that require complex reasoning. Encouragingly, we demonstrate that augmenting training with synthetic multi-hop reasoning data significantly enhances their capabilities in such instances. We release our code, model, and dataset at https://github.com/just1nseo/verifying-the-verifiers.
翻译:事实核查对于确保大型语言模型应用的可靠性至关重要。本研究通过整合14个事实核查基准测试中的示例,评估了12个预训练大型语言模型和1个专用事实核查器,涵盖前沿大型语言模型和开源推理模型。我们提出三项旨在指导未来开发更稳健事实核查器的发现。首先,我们强调解决数据集中标注错误和模糊性问题的重要性,证明约16%的模糊或错误标注数据会显著影响模型排名。忽视该问题可能导致比较评估中产生误导性结论,我们建议采用基于大型语言模型即评判者的系统化流程来大规模识别此类问题。其次,我们发现以往研究常忽视的、采用少量示例上下文学习的前沿大型语言模型能达到顶尖性能。因此我们建议未来研究应纳入与这些简单高效基线的比较。最后,尽管前沿大型语言模型效果显著,但其计算成本高昂,这推动了小型微调事实核查器的开发。我们证明这些小型模型仍有改进空间,特别是在需要复杂推理的实例上。值得鼓舞的是,我们通过合成多跳推理数据增强训练,显著提升了它们在此类实例上的能力。我们在https://github.com/just1nseo/verifying-the-verifiers发布了代码、模型和数据集。