Out-of-distribution generalization capabilities of sequence-to-sequence models can be studied from the lens of two crucial forms of generalization: length generalization -- the ability to generalize to longer sequences than ones seen during training, and compositional generalization: the ability to generalize to token combinations not seen during training. In this work, we provide first provable guarantees on length and compositional generalization for common sequence-to-sequence models -- deep sets, transformers, state space models, and recurrent neural nets -- trained to minimize the prediction error. We show that limited capacity versions of these different architectures achieve both length and compositional generalization provided the training distribution is sufficiently diverse. In the first part, we study structured limited capacity variants of different architectures and arrive at the generalization guarantees with limited diversity requirements on the training distribution. In the second part, we study limited capacity variants with less structural assumptions and arrive at generalization guarantees but with more diversity requirements on the training distribution.
翻译:序列到序列模型的分布外泛化能力可以从两种关键泛化形式的视角进行研究:长度泛化——即泛化至比训练所见序列更长的序列的能力,以及组合泛化:泛化至训练中未见过的标记组合的能力。在本工作中,我们为常见序列到序列模型——深度集合、Transformer、状态空间模型和循环神经网络——在训练以最小化预测误差时,首次提供了关于长度与组合泛化的可证明保证。我们证明,只要训练分布足够多样化,这些不同架构的有限容量版本能够同时实现长度与组合泛化。在第一部分,我们研究了不同架构的结构化有限容量变体,并在训练分布多样性要求有限的情况下得出了泛化保证。在第二部分,我们研究了结构假设较少的有限容量变体,并得出了泛化保证,但对训练分布的多样性要求更高。