The impressive recent performance of large language models has led many to wonder to what extent they can serve as models of general intelligence or are similar to human cognition. We address this issue by applying GPT-3.5 and GPT-4 to a classic problem in human inductive reasoning known as property induction. Over two experiments, we elicit human judgments on a range of property induction tasks spanning multiple domains. Although GPT-3.5 struggles to capture many aspects of human behaviour, GPT-4 is much more successful: for the most part, its performance qualitatively matches that of humans, and the only notable exception is its failure to capture the phenomenon of premise non-monotonicity. Our work demonstrates that property induction allows for interesting comparisons between human and machine intelligence and provides two large datasets that can serve as benchmarks for future work in this vein.
翻译:大型语言模型近期令人瞩目的表现,使许多人思考它们能在多大程度上作为通用智能模型,或与人类认知相似。我们通过将GPT-3.5和GPT-4应用于人类归纳推理中的经典问题——属性归纳——来探讨这一问题。通过两项实验,我们收集了人类在跨多个领域的属性归纳任务中的判断结果。尽管GPT-3.5难以捕捉人类行为的诸多方面,但GPT-4则成功得多:其表现总体上与人类行为定性一致,唯一显著的例外是未能捕捉到前提非单调性现象。我们的工作表明,属性归纳为人类智能与机器智能的有趣比较提供了途径,并提供了两个大型数据集,可作为此类未来研究的基准。