AI Fact-Checking in the Wild: A Field Evaluation of LLM-Written Community Notes on X

Large language models show promising capabilities for contextual fact-checking on social media: they can verify contested claims through deep research, synthesize evidence from multiple sources, and draft explanations at scale. However, prior work evaluates LLM fact-checking only in controlled settings using benchmarks or crowdworker judgments, leaving open how these systems perform in authentic platform environments. We present the first field evaluation of LLM-based fact-checking deployed on a live social media platform, testing performance directly through X Community Notes' AI writer feature over a three-month period. Our LLM writer, a multi-step pipeline that handles multimodal content (text, images, and videos), conducts web and platform-native search, and writes contextual notes, was deployed to write 1,614 notes on 1,597 tweets and compared against 1,332 human-written notes on the same tweets using 108,169 ratings from 42,521 raters. Direct comparison of note-level platform outcomes is complicated by differences in submission timing and rating exposure between LLM and human notes; we therefore pursue two complementary strategies: a rating-level analysis modeling individual rater evaluations, and a note-level analysis that equalizes rater exposure across note types. Rating-level analysis shows that LLM notes receive more positive ratings than human notes across raters with different political viewpoints, suggesting the potential for LLM-written notes to achieve the cross-partisan consensus. Note-level analysis confirms this advantage: among raters who evaluated all notes on the same post, LLM notes achieve significantly higher helpfulness scores. Our findings demonstrate that LLMs can contribute high-quality, broadly helpful fact-checking at scale, while highlighting that real-world evaluation requires careful attention to platform dynamics absent from controlled settings.

翻译：大语言模型在社交媒体情境化事实核查中展现出显著潜力：可通过深度研究验证争议性主张、综合多源证据并规模化生成解释性文本。然而，现有研究仅通过基准测试或众包评估在受控环境中检验LLM事实核查能力，尚未验证其在真实平台环境中的表现。本文首次开展基于大语言模型的事实核查系统在真实社交平台的现场评估，通过X平台"社区笔记"的AI写作功能连续三个月直接测试系统性能。我们设计的LLM写作系统采用多阶段流水线架构，可处理多模态内容（文本、图像与视频）、执行网络及平台原生搜索、生成上下文注释。该系统被部署用于撰写1,614条笔记（覆盖1,597条推文），并与同一批推文上的1,332条人工撰写笔记进行对比，总计收集42,521名评估者的108,169条评分数据。由于LLM笔记与人工笔记在提交时间与曝光机制上存在差异，直接比较笔记级平台结果存在复杂性；为此我们采取两种互补策略：评估者级分析（对个体评估者评分建模）与笔记级分析（标准化不同笔记类型的评估者曝光量）。评估者级分析显示，在不同政治倾向的评估者群体中，LLM笔记获得的正面评分均高于人工笔记，表明LLM撰写的笔记具有实现跨党派共识的潜力。笔记级分析进一步验证这一优势：在评估过同一帖子所有笔记的评估者中，LLM笔记获得显著更高的有用性评分。研究结果表明，大语言模型可在规模化应用中提供高质量、广泛适用的情境化事实核查，同时强调真实场景评估必须关注受控环境所缺失的平台动态特性。