The impressive recent performance of large language models has led many to wonder to what extent they can serve as models of general intelligence or are similar to human cognition. We address this issue by applying GPT-3.5 and GPT-4 to a classic problem in human inductive reasoning known as property induction. Over two experiments, we elicit human judgments on a range of property induction tasks spanning multiple domains. Although GPT-3.5 struggles to capture many aspects of human behaviour, GPT-4 is much more successful: for the most part, its performance qualitatively matches that of humans, and the only notable exception is its failure to capture the phenomenon of premise non-monotonicity. Our work demonstrates that property induction allows for interesting comparisons between human and machine intelligence and provides two large datasets that can serve as benchmarks for future work in this vein.
翻译:近年来,大型语言模型展现出的惊人性能引发广泛讨论:它们在多大程度上可作为通用智能的模型,或与人类认知相似?我们通过将GPT-3.5和GPT-4应用于人类归纳推理中的经典问题——属性归纳来探讨该问题。通过两项实验,我们收集了人类在跨多个领域的属性归纳任务上的判断。尽管GPT-3.5难以捕捉人类行为的诸多方面,但GPT-4取得了显著成功:其表现大多在性质上与人类相匹配,唯一显著例外是未能捕捉前提非单调性现象。本研究证明,属性归纳为人类智能与机器智能的比较提供了有趣视角,并提供了两个大型数据集,可作为后续同类研究的基准。