The impressive recent performance of large language models has led many to wonder to what extent they can serve as models of general intelligence or are similar to human cognition. We address this issue by applying GPT-3 and GPT-4 to a classic problem in human inductive reasoning known as property induction. Over two experiments, we elicit human judgments on a range of property induction tasks spanning multiple domains. Although GPT-3 struggles to capture many aspects of human behaviour, GPT-4 is much more successful: for the most part, its performance qualitatively matches that of humans, and the only notable exception is its failure to capture the phenomenon of premise non-monotonicity. Overall, this work not only demonstrates that property induction is an interesting skill on which to compare human and machine intelligence, but also provides two large datasets that can serve as suitable benchmarks for future work in this vein.
翻译:大型语言模型近期表现令人瞩目,引发了许多关于它们能在多大程度上作为通用智能模型或与人类认知相似的探讨。我们通过将GPT-3和GPT-4应用于人类归纳推理中的一个经典问题——属性归纳——来探讨这一问题。通过两项实验,我们收集了人类在涵盖多个领域的属性归纳任务上的判断。尽管GPT-3在捕捉人类行为的诸多方面存在困难,但GPT-4表现得更为成功:在大多数情况下,其表现与人类在性质上相匹配,唯一的显著例外是未能捕捉到前提非单调性现象。总体而言,这项工作不仅证明了属性归纳是比较人类与机器智能的一个有趣技能,还提供了两个大型数据集,可作为未来相关研究的合适基准。