Length generalization -- the ability to generalize to longer sequences than ones seen during training, and compositional generalization -- the ability to generalize to token combinations not seen during training, are crucial forms of out-of-distribution generalization in sequence-to-sequence models. In this work, we take the first steps towards provable length and compositional generalization for a range of architectures, including deep sets, transformers, state space models, and simple recurrent neural nets. Depending on the architecture, we prove different degrees of representation identification, e.g., a linear or a permutation relation with ground truth representation, is necessary for length and compositional generalization.
翻译:长度泛化——即泛化到比训练时见过的更长序列的能力,以及组合泛化——即泛化到训练时未见过的标记组合的能力,是序列到序列模型中关键的非分布内泛化形式。本文首次为多种架构(包括深度集、Transformer、状态空间模型及简单递归神经网络)的可证明长度与组合泛化奠定基础。根据不同架构,我们证明了不同程度的表示识别(例如与真实表示呈线性或置换关系)对于实现长度与组合泛化是必要的。