How should we compare the capabilities of language models (LMs) and humans? I draw inspiration from comparative psychology to highlight some challenges. In particular, I consider a case study: processing of recursively nested grammatical structures. Prior work suggests that LMs cannot handle these structures as reliably as humans can. However, the humans were provided with instructions and training, while the LMs were evaluated zero-shot. I therefore match the evaluation more closely. Providing large LMs with a simple prompt -- substantially less content than the human training -- allows the LMs to consistently outperform the human results, and even to extrapolate to more deeply nested conditions than were tested with humans. Further, reanalyzing the prior human data suggests that the humans may not perform above chance at the difficult structures initially. Thus, large LMs may indeed process recursively nested grammatical structures as reliably as humans. This case study highlights how discrepancies in the evaluation can confound comparisons of language models and humans. I therefore reflect on the broader challenge of comparing human and model capabilities, and highlight an important difference between evaluating cognitive models and foundation models.
翻译:如何比较语言模型(LMs)与人类的能力?我从比较心理学中汲取灵感,以揭示若干挑战。具体而言,我聚焦于一个案例研究:递归嵌套语法结构的处理。先前研究表明,语言模型在处理此类结构时不如人类可靠。然而,人类受试者获得了指导与训练,而语言模型则在零样本条件下接受评估。为此,我使评估条件更为匹配。为大型语言模型提供简单的提示——其内容远少于人类训练内容——即可使模型持续超越人类结果,甚至能推广到比人类测试条件更深的嵌套层级。此外,对人类先前数据的重新分析表明,人类在处理复杂结构时最初可能并未表现出高于随机水平的性能。因此,大型语言模型或许确实能像人类一样可靠地处理递归嵌套的语法结构。这一案例研究凸显了评估条件差异会如何混淆语言模型与人类的能力比较。我进而反思比较人类与模型能力的更广泛挑战,并指出评估认知模型与基础模型之间的重要差异。