Length generalization -- the ability to generalize to longer sequences than ones seen during training, and compositional generalization -- the ability to generalize to token combinations not seen during training, are crucial forms of out-of-distribution generalization in sequence-to-sequence models. In this work, we take the first steps towards provable length and compositional generalization for a range of architectures, including deep sets, transformers, state space models, and simple recurrent neural nets. Depending on the architecture, we prove different degrees of representation identification, e.g., a linear or a permutation relation with ground truth representation, is necessary for length and compositional generalization.
翻译:长度泛化——即在推理时推广到比训练序列更长的序列的能力,以及组合泛化——即推广到训练中未出现的词元组合的能力,是序列到序列模型中两种重要的分布外泛化形式。在本工作中,我们首次针对包括深度集、Transformer、状态空间模型和简单循环神经网络在内的一系列架构,探索实现可证明的长度与组合泛化能力。基于不同架构,我们证明了不同程度的表示识别(例如与真实表示呈线性或置换关系)是实现长度与组合泛化的必要条件。