Out-of-distribution generalization capabilities of sequence-to-sequence models can be studied from the lens of two crucial forms of generalization: length generalization -- the ability to generalize to longer sequences than ones seen during training, and compositional generalization: the ability to generalize to token combinations not seen during training. In this work, we provide first provable guarantees on length and compositional generalization for common sequence-to-sequence models -- deep sets, transformers, state space models, and recurrent neural nets -- trained to minimize the prediction error. Taking a first principles perspective, we study the realizable case, i.e., the labeling function is realizable on the architecture. We show that \emph{simple limited capacity} versions of these different architectures achieve both length and compositional generalization. In all our results across different architectures, we find that the learned representations are linearly related to the representations generated by the true labeling function.
翻译:序列到序列模型的分布外泛化能力可从两种关键泛化形式的视角进行研究:长度泛化——即泛化至比训练所见序列更长的序列的能力,以及组合泛化:泛化至训练中未见过的标记组合的能力。在本工作中,我们针对常见序列到序列模型——深度集合、Transformer、状态空间模型和循环神经网络——在最小化预测误差的训练目标下,首次提供了关于长度与组合泛化的可证明保证。基于第一性原理视角,我们研究可实现情形,即标注函数在架构上是可实现的。我们证明这些不同架构的\emph{简单有限容量}版本能够同时实现长度与组合泛化。在所有针对不同架构的研究结果中,我们发现学习到的表示与真实标注函数生成的表示呈线性相关。