The growing scale of online misinformation urgently demands Automated Fact-Checking (AFC). Existing benchmarks for evaluating AFC systems, however, are largely limited in terms of task scope, modalities, domain, language diversity, realism, or coverage of misinformation types. Critically, they are static, thus subject to data leakage as their claims enter the pretraining corpora of LLMs. As a result, benchmark performance no longer reliably reflects the actual ability to verify claims. We introduce Verified Theses and Statements (VeriTaS), the first dynamic benchmark for multimodal AFC, designed to remain robust under ongoing large-scale pretraining of foundation models. VeriTaS currently comprises 25,000 real-world claims from 104 professional fact-checking organizations across 54 languages, covering textual and audiovisual content. Claims are added quarterly via a fully automated seven-stage pipeline that normalizes claim formulation, retrieves original media, and maps heterogeneous expert verdicts to a novel, standardized, and disentangled scoring scheme with textual justifications. Through human evaluation, we demonstrate that the automated annotations closely match human judgments. We commit to updating VeriTaS in the future, establishing a leakage-resistant benchmark, supporting meaningful AFC evaluation in the era of rapidly evolving foundation models. The code and data are publicly available under https://veritas.mai.informatik.tu-darmstadt.de .
翻译:在线虚假信息的规模日益增长,亟需自动化事实核查(AFC)技术。然而,现有用于评估AFC系统的基准在任务范围、模态、领域、语言多样性、真实性或虚假信息类型覆盖方面存在显著局限性。尤为关键的是,这些基准是静态的,其包含的主张会混入大型语言模型的预训练语料,导致数据泄露。因此,基准性能已无法可靠反映真实的观点验证能力。我们提出VeriTaS(Verified Theses and Statements)——首个面向多模态AFC的动态基准,旨在基础模型持续大规模预训练的背景下保持稳健性。VeriTaS当前包含来自104家专业事实核查机构、覆盖54种语言的25,000条真实主张,涵盖文本与音视频内容。通过全自动七阶段流水线,每季度新增主张:该流水线可标准化主张表述、检索原始媒体,并将异构专家判定映射至一套全新的标准化、可解耦评分体系(附带文本说明)。经人工评估验证,自动化标注结果与人类判断高度一致。我们承诺未来将持续更新VeriTaS,构建抗泄露的基准,以支持快速发展的基础模型时代中有意义的AFC评估。代码与数据已开源:https://veritas.mai.informatik.tu-darmstadt.de。