The growing scale of online misinformation urgently demands Automated Fact-Checking (AFC). Existing benchmarks for evaluating AFC systems, however, are largely limited in terms of task scope, modalities, domain, language diversity, realism, or coverage of misinformation types. Critically, they are static, thus subject to data leakage as their claims enter the pretraining corpora of LLMs. As a result, benchmark performance no longer reliably reflects the actual ability to verify claims. We introduce Verified Theses and Statements (VeriTaS), the first dynamic benchmark for multimodal AFC, designed to remain robust under ongoing large-scale pretraining of foundation models. VeriTaS currently comprises 24,000 real-world claims from 108 professional fact-checking organizations across 54 languages, covering textual and audiovisual content. Claims are added quarterly via a fully automated seven-stage pipeline that normalizes claim formulation, retrieves original media, and maps heterogeneous expert verdicts to a novel, standardized, and disentangled scoring scheme with textual justifications. Through human evaluation, we demonstrate that the automated annotations closely match human judgments. We commit to update VeriTaS in the future, establishing a leakage-resistant benchmark, supporting meaningful AFC evaluation in the era of rapidly evolving foundation models. We will make the code and data publicly available.
翻译:在线虚假信息的规模日益扩大,亟需自动事实核查(AFC)技术。然而,现有用于评估AFC系统的基准在任务范围、模态、领域、语言多样性、真实性或虚假信息类型覆盖方面存在较大局限。关键的是,这些基准是静态的,随着其声称内容进入大型语言模型的预训练语料库,容易产生数据泄露问题。因此,基准测试性能已无法可靠反映验证声称的实际能力。本文提出"已验证论点与陈述"(VeriTaS),这是首个多模态AFC动态基准,旨在确保其在大规模基础模型持续预训练环境下的稳健性。VeriTaS目前包含来自108个专业事实核查机构的24,000条真实世界声称,涵盖54种语言,涉及文本与视听内容。通过全自动七阶段流程,系统每季度新增声称:该流程规范化声称表述,检索原始媒体,并将异构的专家判定映射到新颖、标准化、解耦的评分体系,同时提供文本化论证依据。经人工评估验证,自动化标注结果与人工判断高度吻合。我们承诺将持续更新VeriTaS,建立抗泄露基准,为快速演进的基础模型时代提供有意义的AFC评估支持。代码与数据将公开发布。