We study the problem of length generalization (LG) in transformers: the ability of a model trained on shorter sequences to maintain performance when evaluated on much longer, previously unseen inputs. Prior work by Huang et al. (2025) established that transformers eventually achieve length generalization once the training sequence length exceeds some finite threshold, but left open the question of how large it must be. In this work, we provide the first quantitative bounds on the required training length for length generalization to occur. Motivated by previous empirical and theoretical work, we analyze LG in several distinct problem settings: $\ell_\infty$ error control vs. average error control over an input distribution, infinite-precision softmax attention vs. finite-precision attention (which reduces to an argmax) in the transformer, and one- vs. two-layer transformers. In all scenarios, we prove that LG occurs when the internal behavior of the transformer on longer sequences can be "simulated" by its behavior on shorter sequences seen during training. Our bounds give qualitative estimates for the length of training data required for a transformer to generalize, and we verify these insights empirically. These results sharpen our theoretical understanding of the mechanisms underlying extrapolation in transformers, and formalize the intuition that richer training data is required for generalization on more complex tasks.
翻译:本研究探讨Transformer模型中的长度泛化问题:即模型在较短序列上训练后,在评估时面对更长、未见过的输入序列时保持性能的能力。Huang等人(2025)的先前工作表明,一旦训练序列长度超过某个有限阈值,Transformer最终能够实现长度泛化,但该阈值具体需要多大仍未解决。本文首次对实现长度泛化所需的训练序列长度给出了定量边界。基于先前的实证与理论研究,我们在多个不同问题设定下分析长度泛化:$\ell_\infty$误差控制与输入分布上的平均误差控制、Transformer中无限精度softmax注意力与有限精度注意力(退化为argmax)的对比,以及单层与双层Transformer的对比。在所有场景中,我们证明当Transformer在更长序列上的内部行为能够被其在训练期间见过的较短序列行为“模拟”时,长度泛化就会发生。我们的边界为Transformer实现泛化所需的训练数据长度提供了定性估计,并通过实验验证了这些见解。这些结果深化了我们对Transformer外推机制的理论理解,并形式化了“更复杂的任务需要更丰富的训练数据以实现泛化”这一直觉。