Go Beyond The Obvious: Probing the gap of INFORMAL reasoning ability between Humanity and LLMs by Detective Reasoning Puzzle Benchmark

Zhouhon Gu,Zihan Li,Lin Zhang,Zhuozhi Xiong,Haoning Ye,Yikai Zhang,Wenhao Huang,Xiaoxuan Zhu,Qianyu He,Rui Xu,Sihang Jiang,Shusen Wang,Zili Wang,Hongwei Feng,Zhixu Li,Yanghua Xiao

Informal reasoning ability is the ability to reason based on common sense, experience, and intuition.Humans use informal reasoning every day to extract the most influential elements for their decision-making from a large amount of life-like information.With the rapid development of language models, the realization of general artificial intelligence has emerged with hope. Given the outstanding informal reasoning ability of humans, how much informal reasoning ability language models have has not been well studied by scholars.In order to explore the gap between humans and language models in informal reasoning ability, this paper constructs a Detective Reasoning Benchmark, which is an assembly of 1,200 questions gathered from accessible online resources, aims at evaluating the model's informal reasoning ability in real-life context.Considering the improvement of the model's informal reasoning ability restricted by the lack of benchmark, we further propose a Self-Question Prompt Framework that mimics human thinking to enhance the model's informal reasoning ability.The goals of self-question are to find key elements, deeply investigate the connections between these elements, encourage the relationship between each element and the problem, and finally, require the model to reasonably answer the problem.The experimental results show that human performance greatly outperforms the SoTA Language Models in Detective Reasoning Benchmark.Besides, Self-Question is proven to be the most effective prompt engineering in improving GPT-4's informal reasoning ability, but it still does not even surpass the lowest score made by human participants.Upon acceptance of the paper, the source code for the benchmark will be made publicly accessible.

翻译：非形式推理能力是指基于常识、经验和直觉进行推理的能力。人类每天运用非形式推理，从大量类生活信息中提取对决策最具影响力的要素。随着语言模型的飞速发展，通用人工智能的实现已初现希望。鉴于人类卓越的非形式推理能力，语言模型具备多少非形式推理能力尚未得到学者们的充分研究。为了探究人类与语言模型在非形式推理能力上的差距，本文构建了一个侦探推理基准（Detective Reasoning Benchmark），该基准汇集了从在线资源中收集的1200个问题，旨在评估模型在现实生活情境中的非形式推理能力。考虑到模型非形式推理能力的提升受限于缺乏基准，我们进一步提出了一种模拟人类思维的自问提示框架（Self-Question Prompt Framework），以增强模型的非形式推理能力。自问的目标是寻找关键要素，深入探究这些要素之间的联系，促进每个要素与问题之间的关联，最终要求模型合理回答问题。实验结果表明，在侦探推理基准中，人类表现大幅超越当前最先进的语言模型。此外，自问被证明是提升GPT-4非形式推理能力最有效的提示工程技术，但其表现仍未超过人类参与者的最低得分。论文被接收后，基准的源代码将公开发布。