Reliably counting and generating sequences of items remain a significant challenge for neural networks, including Large Language Models (LLMs). Indeed, although this capability is readily handled by rule-based symbolic systems based on serial computation, learning to systematically deploy counting procedures is difficult for neural models, which should acquire these skills through learning. Previous research has demonstrated that recurrent architectures can only approximately track and enumerate sequences of events, and it remains unclear whether modern deep learning systems, including LLMs, can deploy systematic counting procedures over sequences of discrete symbols. This paper aims to fill this gap by investigating the sequential enumeration abilities of five state-of-the-art LLMs, including proprietary, open-source, and reasoning models. We probe LLMs in sequential naming and production tasks involving lists of letters and words, adopting a variety of prompting instructions to explore the role of chain-of-thought in the spontaneous emerging of counting strategies. We also evaluate open-source models with the same architecture but increasing size to see whether the mastering of counting principles follows scaling laws, and we analyze the embedding dynamics during sequential enumeration to investigate the emergent encoding of numerosity. We find that some LLMs are indeed capable of deploying counting procedures when explicitly prompted to do so, but none of them spontaneously engage in counting when simply asked to enumerate the number of items in a sequence. Our results suggest that, despite their impressive emergent abilities, LLMs cannot yet robustly and systematically deploy counting procedures, highlighting a persistent gap between neural and symbolic approaches to compositional generalization.
翻译:可靠地计数和生成项目序列仍然是神经网络(包括大型语言模型)面临的重大挑战。尽管基于串行计算的规则符号系统能轻松处理这种能力,但神经网络模型需通过学习才能获得系统化部署计数程序的技能,这对其而言仍存在困难。先前研究表明,循环架构仅能近似跟踪和枚举事件序列,而现代深度学习系统(包括大型语言模型)能否对离散符号序列实施系统化计数程序尚不明确。本文通过研究五种最先进的大型语言模型(包括专有模型、开源模型和推理模型)的序列枚举能力来填补这一空白。我们在涉及字母和单词列表的序列命名与生成任务中探测大型语言模型,采用多种提示指令来探索思维链在计数策略自发形成中的作用。同时,我们评估了架构相同但规模递增的开源模型,以检验计数原理的掌握是否遵循缩放定律,并通过分析序列枚举过程中的嵌入动态来探究数量表征的涌现编码。研究发现,部分大型语言模型在明确提示时确实能够部署计数程序,但当仅被要求枚举序列中的项目数量时,没有任何模型会自发进行计数。结果表明,尽管大型语言模型具有令人瞩目的涌现能力,但其仍无法稳健且系统化地实施计数程序,这凸显了神经网络方法与符号方法在组合泛化方面存在的持续差距。