Large language models (LLMs) have performed well on several reasoning benchmarks, including ones that test analogical reasoning abilities. However, it has been debated whether they are actually performing humanlike abstract reasoning or instead employing less general processes that rely on similarity to what has been seen in their training data. Here we investigate the generality of analogy-making abilities previously claimed for LLMs (Webb, Holyoak, & Lu, 2023). We take one set of analogy problems used to evaluate LLMs and create a set of "counterfactual" variants-versions that test the same abstract reasoning abilities but that are likely dissimilar from any pre-training data. We test humans and three GPT models on both the original and counterfactual problems, and show that, while the performance of humans remains high for all the problems, the GPT models' performance declines sharply on the counterfactual set. This work provides evidence that, despite previously reported successes of LLMs on analogical reasoning, these models lack the robustness and generality of human analogy-making.
翻译:大语言模型在多项推理基准测试中表现优异,包括测试类比推理能力的任务。然而,关于这些模型是否真正执行了类似人类的抽象推理,抑或仅仅依赖于训练数据中的相似性而采用通用性较低的过程,学界仍存在争议。本研究针对此前声称大语言模型具备类比生成能力(Webb, Holyoak, & Lu, 2023)的通用性展开探究。我们选取一组用于评估大语言模型的类比问题,并创建其"反事实"变体——这些变体虽测试相同的抽象推理能力,但极可能不包含在任何预训练数据中。我们以人类受试者与三个GPT模型为对象,分别测试原始问题与反事实问题,结果显示:尽管人类在所有问题中均保持高水平表现,GPT模型在反事实集合上的表现却显著下降。本研究证明,尽管此前有报告称大语言模型在类比推理上取得进展,但这些模型缺乏人类类比生成的鲁棒性与通用性。