Large Language Models (LLMs) have significantly advanced the fact-checking studies. However, existing automated fact-checking evaluation methods rely on static datasets and classification metrics, which fail to automatically evaluate the justification production and uncover the nuanced limitations of LLMs in fact-checking. In this work, we introduce FACT-AUDIT, an agent-driven framework that adaptively and dynamically assesses LLMs' fact-checking capabilities. Leveraging importance sampling principles and multi-agent collaboration, FACT-AUDIT generates adaptive and scalable datasets, performs iterative model-centric evaluations, and updates assessments based on model-specific responses. By incorporating justification production alongside verdict prediction, this framework provides a comprehensive and evolving audit of LLMs' factual reasoning capabilities, to investigate their trustworthiness. Extensive experiments demonstrate that FACT-AUDIT effectively differentiates among state-of-the-art LLMs, providing valuable insights into model strengths and limitations in model-centric fact-checking analysis.
翻译:大语言模型(LLMs)显著推动了事实核查研究的发展。然而,现有的自动化事实核查评估方法依赖于静态数据集和分类指标,无法自动评估理由生成过程,也难以揭示LLMs在事实核查中的细微局限性。本文提出FACT-AUDIT,一种基于智能体的框架,能够自适应且动态地评估LLMs的事实核查能力。该框架利用重要性采样原理和多智能体协作,生成自适应且可扩展的数据集,执行以模型为中心的迭代评估,并根据模型特定的响应更新评估结果。通过将理由生成与结论预测相结合,该框架提供了对LLMs事实推理能力的全面且动态的审计,以探究其可信度。大量实验表明,FACT-AUDIT能够有效区分当前最先进的LLMs,在以模型为中心的事实核查分析中,为理解模型的优势与局限性提供了有价值的见解。