We develop a Mean-Field (MF) view of the learning dynamics of overparametrized Artificial Neural Networks (NN) under data symmetric in law wrt the action of a general compact group $G$. We consider for this a class of generalized shallow NNs given by an ensemble of $N$ multi-layer units, jointly trained using stochastic gradient descent (SGD) and possibly symmetry-leveraging (SL) techniques, such as Data Augmentation (DA), Feature Averaging (FA) or Equivariant Architectures (EA). We introduce the notions of weakly and strongly invariant laws (WI and SI) on the parameter space of each single unit, corresponding, respectively, to $G$-invariant distributions, and to distributions supported on parameters fixed by the group action (which encode EA). This allows us to define symmetric models compatible with taking $N\to\infty$ and give an interpretation of the asymptotic dynamics of DA, FA and EA in terms of Wasserstein Gradient Flows describing their MF limits. When activations respect the group action, we show that, for symmetric data, DA, FA and freely-trained models obey the exact same MF dynamic, which stays in the space of WI laws and minimizes therein the population risk. We also give a counterexample to the general attainability of an optimum over SI laws. Despite this, quite remarkably, we show that the set of SI laws is also preserved by the MF dynamics even when freely trained. This sharply contrasts the finite-$N$ setting, in which EAs are generally not preserved by unconstrained SGD. We illustrate the validity of our findings as $N$ gets larger in a teacher-student experimental setting, training a student NN to learn from a WI, SI or arbitrary teacher model through various SL schemes. We last deduce a data-driven heuristic to discover the largest subspace of parameters supporting SI distributions for a problem, that could be used for designing EA with minimal generalization error.
翻译:本文从平均场视角研究了在数据分布关于一般紧致群 $G$ 的作用具有对称性的条件下,过参数化人工神经网络的学习动态。为此,我们考虑一类广义浅层神经网络,它由 $N$ 个多层单元组成的集合构成,这些单元通过随机梯度下降联合训练,并可能采用对称性利用技术,例如数据增强、特征平均或等变架构。我们引入了单单元参数空间上的弱不变律和强不变律概念,分别对应于 $G$ 不变分布,以及支撑在群作用固定参数集上的分布。这使我们能够定义与取极限 $N\to\infty$ 兼容的对称模型,并用描述其平均场极限的 Wasserstein 梯度流来解释数据增强、特征平均和等变架构的渐近动态。当激活函数尊重群作用时,我们证明对于对称数据,数据增强、特征平均和自由训练模型遵循完全相同的平均场动态,该动态保持在弱不变律空间中并在其中最小化总体风险。我们还给出了一个反例,说明在强不变律上达到最优解在一般情况下不可实现。尽管如此,值得注意的是,我们证明了即使在自由训练下,强不变律集合也由平均场动态所保持。这与有限 $N$ 设置形成鲜明对比,在后者中,等变架构通常不被无约束的随机梯度下降所保持。我们在一个师生实验设置中,通过训练一个学生神经网络使用各种对称性利用方案从弱不变、强不变或任意的教师模型学习,验证了我们的发现在 $N$ 增大时的有效性。最后,我们推导出一种数据驱动的启发式方法,用于发现支撑问题强不变分布的最大参数子空间,这可用于设计具有最小泛化误差的等变架构。