In Generative AI we Trust: Can Chatbots Effectively Verify Political Information?

This article presents a comparative analysis of the ability of two large language model (LLM)-based chatbots, ChatGPT and Bing Chat, recently rebranded to Microsoft Copilot, to detect veracity of political information. We use AI auditing methodology to investigate how chatbots evaluate true, false, and borderline statements on five topics: COVID-19, Russian aggression against Ukraine, the Holocaust, climate change, and LGBTQ+ related debates. We compare how the chatbots perform in high- and low-resource languages by using prompts in English, Russian, and Ukrainian. Furthermore, we explore the ability of chatbots to evaluate statements according to political communication concepts of disinformation, misinformation, and conspiracy theory, using definition-oriented prompts. We also systematically test how such evaluations are influenced by source bias which we model by attributing specific claims to various political and social actors. The results show high performance of ChatGPT for the baseline veracity evaluation task, with 72 percent of the cases evaluated correctly on average across languages without pre-training. Bing Chat performed worse with a 67 percent accuracy. We observe significant disparities in how chatbots evaluate prompts in high- and low-resource languages and how they adapt their evaluations to political communication concepts with ChatGPT providing more nuanced outputs than Bing Chat. Finally, we find that for some veracity detection-related tasks, the performance of chatbots varied depending on the topic of the statement or the source to which it is attributed. These findings highlight the potential of LLM-based chatbots in tackling different forms of false information in online environments, but also points to the substantial variation in terms of how such potential is realized due to specific factors, such as language of the prompt or the topic.

翻译：本文对两款基于大语言模型（LLM）的聊天机器人——ChatGPT与近期更名为Microsoft Copilot的Bing Chat——在政治信息真实性检测方面的能力进行了比较分析。我们采用人工智能审计方法，探究聊天机器人如何评估涉及新冠、俄罗斯对乌克兰的侵略、大屠杀、气候变化及LGBTQ+相关议题五个领域的真实、虚假及边界性陈述。通过使用英语、俄语和乌克兰语提示语，比较了这两款聊天机器人在高资源语言与低资源语言中的表现。进一步地，我们利用定义导向型提示语，探索聊天机器人依据政治传播学中虚假信息、错误信息和阴谋论等概念评估陈述的能力。同时，系统测试了源偏见对此类评估的影响——通过将特定主张归属于不同政治与社会主体进行建模。结果显示，ChatGPT在基线真实性评估任务中表现优异，未经预训练的平均语言正确率达72%，而Bing Chat的准确率为67%。我们观察到聊天机器人在评估高资源与低资源语言提示语时存在显著差异，且在适配政治传播学概念时，ChatGPT的输出比Bing Chat更具细致性。最后发现，对于某些真实性检测任务，聊天机器人的表现会因陈述主题或归属来源不同而产生波动。这些发现既彰显了基于LLM的聊天机器人应对网络环境中不同形式虚假信息的潜力，也揭示了其实现程度受提示语语言、主题等特定因素影响而呈现的显著差异性。