Assessing the Reasoning Abilities of ChatGPT in the Context of Claim Verification

The reasoning capabilities of LLMs are currently hotly debated. We examine the issue from the perspective of claim/rumour verification. We propose the first logical reasoning framework designed to break down any claim or rumor paired with evidence into the atomic reasoning steps necessary for verification. Based on our framework, we curate two annotated collections of such claim/evidence pairs: a synthetic dataset from Wikipedia and a real-world set stemming from rumours circulating on Twitter. We use them to evaluate the reasoning capabilities of GPT-3.5-Turbo and GPT-4 (hereinafter referred to as ChatGPT) within the context of our framework, providing a thorough analysis. Our results show that ChatGPT struggles in abductive reasoning, although this can be somewhat mitigated by using manual Chain of Thought (CoT) as opposed to Zero Shot (ZS) and ZS CoT approaches. Our study contributes to the growing body of research suggesting that ChatGPT's reasoning processes are unlikely to mirror human-like reasoning, and that LLMs need to be more rigorously evaluated in order to distinguish between hype and actual capabilities, especially in high stake real-world tasks such as claim verification.

翻译：大型语言模型（LLM）的推理能力目前备受争议。我们从声明/谣言验证的视角探讨这一问题。本文首次提出逻辑推理框架，旨在将任意声明或谣言及其配套证据分解为验证所需的原子化推理步骤。基于这一框架，我们构建了两个带标注的声明-证据对数据集：一个源自维基百科的合成数据集，以及一个基于Twitter传播谣言的真实世界数据集。我们利用这些数据集，在该框架下系统性评估了GPT-3.5-Turbo与GPT-4（统称ChatGPT）的推理能力，并进行了深入分析。结果表明，ChatGPT在溯因推理方面表现不佳，尽管通过人工链式思维（CoT）提示方法相比零样本（ZS）和零样本链式思维（ZS CoT）方法能部分缓解这一缺陷。本研究进一步印证了相关研究结论：ChatGPT的推理过程难以模拟人类思维模式，且针对LLM的评估需要更为严谨，以区分其实际能力与宣传夸大成分——特别是在事实验证这类高风险现实任务中。

相关内容

ChatGPT

关注 0

ChatGPT（全名：Chat Generative Pre-trained Transformer），美国OpenAI 研发的聊天机器人程序 [1] ，于2022年11月30日发布。ChatGPT是人工智能技术驱动的自然语言处理工具，它能够通过学习和理解人类的语言来进行对话，还能根据聊天的上下文进行互动，真正像人类一样来聊天交流，甚至能完成撰写邮件、视频脚本、文案、翻译、代码，写论文任务。 [1] https://openai.com/blog/chatgpt/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日