Out-of-distribution generalization capabilities of sequence-to-sequence models can be studied from the lens of two crucial forms of generalization: length generalization -- the ability to generalize to longer sequences than ones seen during training, and compositional generalization: the ability to generalize to token combinations not seen during training. In this work, we provide first provable guarantees on length and compositional generalization for common sequence-to-sequence models -- deep sets, transformers, state space models, and recurrent neural nets -- trained to minimize the prediction error. Taking a first principles perspective, we study the realizable case, i.e., the labeling function is realizable on the architecture. We show that limited capacity versions of these different architectures achieve both length and compositional generalization. Across different architectures, we also find that a linear relationship between the learned representation and the representation in the labeling function is necessary for length and compositional generalization.
翻译:序列到序列模型的分布外泛化能力可从两种关键泛化形式的视角进行研究:长度泛化——即泛化至比训练所见更长的序列的能力,以及组合泛化——即泛化至训练未见过的标记组合的能力。本工作中,我们针对常见序列到序列模型——深度集合、Transformer、状态空间模型和循环神经网络——在最小化预测误差的训练目标下,首次提供了关于长度与组合泛化的可证明保证。基于第一性原理视角,我们研究可实现情形,即标注函数在模型架构上可实现。我们证明这些不同架构的有限容量版本均能实现长度与组合泛化。在不同架构间,我们还发现学习到的表示与标注函数中的表示之间存在线性关系,这是实现长度与组合泛化的必要条件。