For humans, language production and comprehension is sensitive to the hierarchical structure of sentences. In natural language processing, past work has questioned how effectively neural sequence models like transformers capture this hierarchical structure when generalizing to structurally novel inputs. We show that transformer language models can learn to generalize hierarchically after training for extremely long periods -- far beyond the point when in-domain accuracy has saturated. We call this phenomenon \emph{structural grokking}. On multiple datasets, structural grokking exhibits inverted U-shaped scaling in model depth: intermediate-depth models generalize better than both very deep and very shallow transformers. When analyzing the relationship between model-internal properties and grokking, we find that optimal depth for grokking can be identified using the tree-structuredness metric of \citet{murty2023projections}. Overall, our work provides strong evidence that, with extended training, vanilla transformers discover and use hierarchical structure.
翻译:对于人类而言,语言生成和理解对句子的层级结构敏感。在自然语言处理领域,过往研究质疑了Transformer等神经序列模型在泛化至结构新颖的输入时,能否有效捕捉这种层级结构。我们证明,经过极长时间的训练——远超领域内准确率饱和点——Transformer语言模型能够学习层级泛化。我们将此现象称为"结构顿悟"。在多个数据集上,结构顿悟呈现模型深度的倒U形缩放特性:中等深度模型的泛化能力优于极深和极浅的Transformer。在分析模型内部特性与顿悟的关系时,我们发现可利用citet{murty2023projections}的树结构化度量指标识别最优顿悟深度。总体而言,我们的工作有力证明:经过延长训练,香草Transformer能够发现并利用层级结构。