Large language models (LLMs) such as ChatGPT and GPT-4 have shown impressive performance in complex reasoning tasks. However, it is difficult to know whether the models are reasoning based on deep understandings of truth and logic, or leveraging their memorized patterns in a relatively superficial way. In this work, we explore testing LLMs' reasoning by engaging with them in a debate-like conversation, where given a question, the LLM and the user need to discuss to make the correct decision starting from opposing arguments. Upon mitigating the Clever Hans effect, our task requires the LLM to not only achieve the correct answer on its own, but also be able to hold and defend its belief instead of blindly believing or getting misled by the user's (invalid) arguments and critiques, thus testing in greater depth whether the LLM grasps the essence of the reasoning required to solve the problem. Across a range of complex reasoning benchmarks spanning math, commonsense, logic and BIG-Bench tasks, we find that despite their impressive performance as reported in existing work on generating correct step-by-step solutions in the beginning, LLMs like ChatGPT cannot maintain their beliefs in truth for a significant portion of examples when challenged by oftentimes absurdly invalid arguments. Our work points to danger zones of model alignment, and also suggests more careful treatments and interpretations of the recent findings that LLMs can improve their responses based on feedback.
翻译:以ChatGPT和GPT-4为代表的大语言模型在复杂推理任务中展现出卓越性能。然而,尚难判断这些模型是基于对真理与逻辑的深层理解进行推理,抑或仅以相对浅层的方式利用其记忆模式。本研究通过引导大语言模型参与辩论式对话来探索其推理能力——给定问题后,模型与用户需从对立论点出发展开讨论以达成正确决策。在缓解Clever Hans效应后,本任务要求模型不仅独立得出正确答案,还需要坚持并捍卫自身信念,而非盲目相信或受用户(无效)论点与批评的误导,从而更深入地检验模型是否真正掌握解决该问题所需的推理本质。在涵盖数学、常识、逻辑及BIG-Bench任务的一系列复杂推理基准测试中我们发现:尽管现有研究报道大语言模型(如ChatGPT)最初能生成正确的逐步推导方案并展现惊人性能,但当面对常显荒谬的无效论点挑战时,其在相当比例案例中无法坚持对真理的信念。本研究揭示了模型对齐的危险区域,同时建议对"大语言模型能基于反馈改进应答"这一最新发现需采取更审慎的对待方式与解读视角。