Despite the remarkable practical success of transformer-based language models, recent work has raised concerns about their ability to perform state tracking. In particular, a growing body of literature has shown this limitation primarily through failures in out-of-distribution (OOD) generalization, such as length extrapolation. In this work, we shift attention to the in-distribution implications of these limitations. We conduct a large-scale experimental study of the data efficiency of transformers and recurrent neural networks (RNNs) across multiple supervision regimes. We find that the amount of training data required by transformers grows much more rapidly with state-space size and sequence length than for RNNs. Furthermore, we analyze the extent to which learned state-tracking mechanisms are shared across different sequence lengths. We show that transformers exhibit negligible or even detrimental weight sharing across lengths, indicating that they learn length-specific solutions in isolation. In contrast, recurrent models exhibit effective amortized learning by sharing weights across lengths, allowing data from one sequence length to improve performance on others. Together, these results demonstrate that state tracking remains a fundamental challenge for transformers, even when training and evaluation distributions match.
翻译:尽管基于Transformer的语言模型取得了显著的实践成功,但近期研究对其执行状态跟踪的能力提出了质疑。特别是,越来越多的文献主要通过分布外泛化(如长度外推)的失败案例揭示了这一局限性。在本研究中,我们将关注点转向这些局限性的分布内影响。我们在大规模实验中对Transformer和循环神经网络(RNNs)在多种监督机制下的数据效率进行了系统性研究。研究发现,随着状态空间规模和序列长度的增加,Transformer所需训练数据量的增长速率远高于RNNs。此外,我们分析了习得的状态跟踪机制在不同序列长度间的共享程度。实验表明,Transformer在不同长度间表现出可忽略甚至有害的权重共享,这意味着它们孤立地学习长度特定的解决方案。相比之下,循环模型通过跨长度共享权重实现了有效的摊销学习,使得来自某一序列长度的数据能够提升其他长度的性能。综合而言,这些结果表明状态跟踪仍然是Transformer面临的根本性挑战,即使在训练与评估分布匹配的情况下亦是如此。