This paper considers the learning of logical (Boolean) functions with focus on the generalization on the unseen (GOTU) setting, a strong case of out-of-distribution generalization. This is motivated by the fact that the rich combinatorial nature of data in certain reasoning tasks (e.g., arithmetic/logic) makes representative data sampling challenging, and learning successfully under GOTU gives a first vignette of an 'extrapolating' or 'reasoning' learner. We then study how different network architectures trained by (S)GD perform under GOTU and provide both theoretical and experimental evidence that for a class of network models including instances of Transformers, random features models, and diagonal linear networks, a min-degree-interpolator is learned on the unseen. We also provide evidence that other instances with larger learning rates or mean-field networks reach leaky min-degree solutions. These findings lead to two implications: (1) we provide an explanation to the length generalization problem (e.g., Anil et al. 2022); (2) we introduce a curriculum learning algorithm called Degree-Curriculum that learns monomials more efficiently by incrementing supports.
翻译:本文研究逻辑(布尔)函数的深度学习问题,重点关注未见数据泛化(GOTU)设定——一种强形式的分布外泛化。这一工作源于以下事实:算术/逻辑等推理任务中数据丰富的组合特性使得代表性数据采样极具挑战性,而成功实现GOTU学习将为“外推型”或“推理型”学习者提供首个初步范式。我们进而研究不同网络架构在随机梯度下降/梯度下降((S)GD)训练下于GOTU场景的表现,通过理论与实验证据表明:对于包含Transformer实例、随机特征模型及对角线性网络在内的网络模型类,算法在未见数据上学习到最小度插值器。同时,我们提供证据表明,采用较大学习率的其他实例或平均场网络会收敛至泄漏最小度解。这些发现带来两个启示:(1)为长度泛化问题(如Anil等人2022年研究)提供解释;(2)提出一种名为度课程学习(Degree-Curriculum)的课程学习算法,通过逐步增加支持集实现更高效的单变量学习。