Search engines and AI-powered systems increasingly mediate access to factual information, yet their reliability remains difficult to evaluate in realistic information-seeking settings. We study this problem in the Chinese web ecosystem by constructing a query-based fact-checking dataset from real Chinese search logs and comparing nine systems across traditional search engines, standalone large language models, and search-integrated AI Overviews. Focusing on factual Chinese-language factual Yes/No questions, we evaluate whether systems provide correct, incorrect, or uncertain decisions against evidence-derived ground truth. We find that systems are similarly accurate when they provide definitive answers, but differ sharply in how often they do so. Conditional accuracy ranges from 73.2% to 78.9%, yet search engines answer definitively on over 83% of queries, while Qwen-Max does so on fewer than half. We also find a consistent polarity gap: all systems perform better on yes-labeled queries than on no-labeled queries. We also use Baidu Index data to identify Chinese provinces with higher health-related search attention, which may indicate greater potential exposure to misinformation. Overall, our results show that reliability depends not only on whether systems are correct when they answer, but also on how often they answer, how they handle negative claims, and where information demand may increase exposure risks.
翻译:搜索引擎与AI驱动系统日益成为获取事实信息的中间媒介,然而在真实信息检索场景中评估其可靠性仍存在困难。我们通过从真实中文搜索日志中构建基于查询的事实核查数据集,对传统搜索引擎、独立大语言模型及搜索集成AI概述三类系统共九种进行比较研究。聚焦事实性中文是非问句,我们依据证据推导的真实答案评估系统能否给出正确、错误或不确定的判断。研究发现,当系统提供明确答案时准确率相近,但提供答案的频率差异显著。条件准确率介于73.2%至78.9%之间,然而搜索引擎对超过83%的查询给出明确答案,而Qwen-Max对半数以下的查询给出明确答案。我们还发现一致的极性偏差:所有系统对标记为"是"的查询表现优于标记为"否"的查询。通过百度指数数据分析,我们识别出健康相关搜索关注度更高的中国省份,这可能意味着更大的潜在错误信息暴露风险。总体而言,研究结果表明系统可靠性不仅取决于回答时的正确性,还取决于回答频率、否定性主张的处理方式以及信息需求可能加剧暴露风险的区域特征。