Evaluating Commercial AI Chatbots as News Intermediaries

AI chatbots are rapidly shaping how people encounter the news, yet no prior study has systematically measured how accurately these systems, with their proprietary search integrations and retrieval-synthesis pipelines, handle emerging facts across languages and regions. We present a 14-day (February 9-22, 2026) evaluation of six AI chatbots (Gemini 3 Flash and Pro, Grok 4, Claude 4.5 Sonnet, GPT-5 and GPT-4o mini) on 2,100 factual questions derived from same-day BBC News reporting across six regional services (US & Canada, Arabic, Afrique, Hindi, Russian, Turkish). The best systems achieve over 90% multiple-choice accuracy on questions about events reported hours earlier. The same systems, however, lose 11-13% under free-response evaluation, and 16-17% across the cohort. We further characterize three failure patterns. First, every model achieves its lowest accuracy on Hindi (79% vs. 89-91% elsewhere) and citations indicate an Anglophone retrieval bias (e.g., models answering Hindi queries cite English Wikipedia more than any Hindi outlet). Second, retrieval, not reasoning, failures drive over 70% of all errors. When models retrieve a correct source, they often extract the correct answer; the problem is to land on the right source in the first place. Third, models achieving 88-96% accuracy on well-formed questions drop to 19-70% when questions contain subtle false premises, with the most vulnerable model accepting fabricated facts 64% of the time. We also identify a detection-accuracy paradox: the best false-premise detector ranks second in adversarial accuracy (abstention rate), while a weaker detector ranks first, showing that premise detection and answer recovery are partially independent capabilities. Overall, these suggest that high accuracy can mask systematic regional inequity, near-total dependence on retrieval infrastructure, and vulnerability to imperfect queries real users pose.

翻译：AI聊天机器人正快速改变人们接触新闻的方式，然而此前尚无系统性研究测量这些系统（凭借其专有搜索集成和检索-合成流水线）如何跨语言和地区准确处理新兴事实。我们开展了一项为期14天（2026年2月9日至22日）的评估，针对六个AI聊天机器人（Gemini 3 Flash和Pro、Grok 4、Claude 4.5 Sonnet、GPT-5及GPT-4o mini），基于同一天BBC新闻在六个地区服务（美国与加拿大、阿拉伯语、非洲、印地语、俄语、土耳其语）中报道的2100个事实性问题进行测试。表现最佳的系统在数小时前报道事件的多项选择题上准确率超过90%。然而，同一系统在自由回答评估中准确率下降11-13%，而整体系统群下降16-17%。我们进一步归纳出三种失败模式。第一，所有模型在印地语上的准确率最低（79%对比其他地区的89-91%），且引用显示出英语中心检索偏差（例如，回答印地语问题的模型引用英语维基百科的频率高于任何印地语媒体）。第二，检索失败（而非推理失败）驱动了超过70%的错误。当模型检索到正确来源时，往往能提取正确答案；问题在于最初未能定位正确来源。第三，对于结构良好的问题准确率达88-96%的模型，当问题包含微妙错误前提时，准确率骤降至19-70%，其中最容易受影响的模型有64%的时间会接受捏造的事实。我们还发现一种检测-准确率悖论：最佳的错误前提检测器在对抗性准确率（回避率）上排名第二，而较弱的检测器却排名第一，表明前提检测与答案恢复是部分独立的能力。总体而言，这些结果表明，高准确率可能掩盖系统性区域不平等、对检索基础设施的近乎完全依赖，以及真实用户提出不完美查询时的脆弱性。