The rapid development of Large Language Models (LLMs) has transformed fake news detection and fact-checking tasks from simple classification to complex reasoning. However, evaluation frameworks have not kept pace. Current benchmarks are static, making them vulnerable to benchmark data contamination (BDC) and ineffective at assessing reasoning under temporal uncertainty. To address this, we introduce LiveFact a continuously updated benchmark that simulates the real-world "fog of war" in misinformation detection. LiveFact uses dynamic, temporal evidence sets to evaluate models on their ability to reason with evolving, incomplete information rather than on memorized knowledge. We propose a dual-mode evaluation: Classification Mode for final verification and Inference Mode for evidence-based reasoning, along with a component to monitor BDC explicitly. Tests with 22 LLMs show that open-source Mixture-of-Experts models, such as Qwen3-235B-A22B, now match or outperform proprietary state-of-the-art systems. More importantly, our analysis finds a significant "reasoning gap." Capable models exhibit epistemic humility by recognizing unverifiable claims in early data slices-an aspect traditional static benchmarks overlook. LiveFact sets a sustainable standard for evaluating robust, temporally aware AI verification.
翻译:大型语言模型(LLM)的快速发展已将虚假新闻检测与事实核查任务从简单分类转向复杂推理。然而,现有评估框架未能同步跟进。当前基准均为静态,易受基准数据污染(BDC)影响,且难以评估在时间不确定性下的推理能力。为此,我们提出LiveFact——一种持续更新的基准,模拟虚假信息检测中真实世界的“迷雾困境”。LiveFact采用动态、时序化的证据集,评估模型基于演变中的不完整信息进行推理的能力,而非依赖记忆化知识。我们提出双模态评估:用于终极验证的分类模式与基于证据推理的推理模式,并配备显式监测BDC的组件。对22个LLM的测试表明,开源混合专家模型(如Qwen3-235B-A22B)现已达到或超越专有最优系统性能。更重要的是,我们的分析发现显著的“推理鸿沟”:具备良好能力的模型通过识别早期数据切片中无法验证的主张展现认知谦逊——这一维度是传统静态基准所忽略的。LiveFact为评估鲁棒、具备时序感知能力的AI验证系统树立了可持续标准。