Increase in computational scale and fine-tuning has seen a dramatic improvement in the quality of outputs of large language models (LLMs) like GPT. Given that both GPT-3 and GPT-4 were trained on large quantities of human-generated text, we might ask to what extent their outputs reflect patterns of human thinking, both for correct and incorrect cases. The Erotetic Theory of Reason (ETR) provides a symbolic generative model of both human success and failure in thinking, across propositional, quantified, and probabilistic reasoning, as well as decision-making. We presented GPT-3, GPT-3.5, and GPT-4 with 61 central inference and judgment problems from a recent book-length presentation of ETR, consisting of experimentally verified data-points on human judgment and extrapolated data-points predicted by ETR, with correct inference patterns as well as fallacies and framing effects (the ETR61 benchmark). ETR61 includes classics like Wason's card task, illusory inferences, the decoy effect, and opportunity-cost neglect, among others. GPT-3 showed evidence of ETR-predicted outputs for 59% of these examples, rising to 77% in GPT-3.5 and 75% in GPT-4. Remarkably, the production of human-like fallacious judgments increased from 18% in GPT-3 to 33% in GPT-3.5 and 34% in GPT-4. This suggests that larger and more advanced LLMs may develop a tendency toward more human-like mistakes, as relevant thought patterns are inherent in human-produced training data. According to ETR, the same fundamental patterns are involved both in successful and unsuccessful ordinary reasoning, so that the "bad" cases could paradoxically be learned from the "good" cases. We further present preliminary evidence that ETR-inspired prompt engineering could reduce instances of these mistakes.
翻译:随着计算规模扩大和微调技术的进步,GPT等大型语言模型(LLMs)的输出质量显著提升。鉴于GPT-3和GPT-4均基于大量人类生成文本进行训练,我们需探究其输出在正确与错误案例中反映人类思维模式的程度。理由的提问理论(Erotetic Theory of Reason, ETR)提供了一个符号化生成模型,涵盖命题推理、量化推理、概率推理及决策制定中人类成功与失败的思维模式。我们向GPT-3、GPT-3.5和GPT-4展示了来自最新ETR专著中的61个核心推理与判断问题,包括经过实验验证的人类判断数据点及ETR预测的外推数据点,涉及正确推理模式、谬误及框架效应(即ETR61基准测试)。ETR61包含经典案例,如沃森选卡任务、幻觉推理、诱饵效应及机会成本忽视等。GPT-3在59%的案例中表现出符合ETR预测的输出,GPT-3.5升至77%,GPT-4为75%。值得注意的是,类似人类的谬误判断比例从GPT-3的18%增至GPT-3.5的33%和GPT-4的34%。这表明更大规模、更先进的LLMs可能倾向于产生更多类似人类的错误,因为相关思维模式已内嵌于人类生成的训练数据中。根据ETR,成功与失败的日常推理遵循相同的基本模式,因此“坏”案例可能反直觉地从“好”案例中习得。我们进一步提供初步证据表明,基于ETR的提示工程可减少此类错误的发生。