We explore testing the reasoning ability of large language models (LLMs), such as ChatGPT, by engaging with them in a debate-like conversation that probes deeper into their understanding of the subject. Specifically, we formulate a new task where given a question, the LLM can generate a correct solution while the user believes in a wrong solution in the beginning, and they need to discuss to make the correct decision through dialogue. Such a setting requires the LLM to not only achieve the correct answer on its own (which could be done by shallow memorization), but also be able to defend the truth instead of blindly believing or getting misled by the user's (invalid) arguments and critiques, thus testing in greater depth whether the LLM grasps the essence of the reasoning required to solve the problem. To automate this evaluation framework and save human labor, we simulate the user using another LLM conditioned on a synthesized wrong solution. Across a range of complex reasoning benchmarks spanning math, commonsense, logic and tasks from BIG-Bench, we find that despite being able to generate correct step-by-step solutions in the beginning, ChatGPT cannot maintain its belief in truth for a significant portion of examples when challenged by often-time absurdly invalid arguments. Our work reveals LLMs' weaknesses not captured by conventional benchmarking, and also points to danger zones of aligning models with human feedback.
翻译:我们探索通过与大型语言模型(如ChatGPT)进行辩论式对话来测试其推理能力,这种对话能更深入地探究模型对主题的理解。具体而言,我们设计了一项新任务:给定一个问题,大语言模型能生成正确答案,而用户最初相信一个错误答案,双方需通过对话讨论最终做出正确决策。该设置要求模型不仅独立得出正确答案(这可通过浅层记忆实现),更能在面对用户(无效)论点与批评时捍卫真理,而非盲目相信或被误导,从而更深度检验模型是否真正掌握解决问题所需的推理本质。为自动化该评估框架并节省人力,我们利用另一个大语言模型基于综合生成的错误答案模拟用户行为。在涵盖数学、常识、逻辑及BIG-Bench任务的一系列复杂推理基准测试中,我们发现尽管ChatGPT起初能生成正确的逐步解决方案,但在面对大量经常荒谬无效的论点质疑时,相当比例的例子中其无法坚持真理信念。本研究揭示了传统基准测试未捕捉到的大语言模型弱点,同时指出了通过人类反馈对齐模型时存在的危险区域。