Symmetries in Overparametrized Neural Networks: A Mean-Field View

We develop a Mean-Field (MF) view of the learning dynamics of overparametrized Artificial Neural Networks (NN) under data symmetric in law wrt the action of a general compact group $G$. We consider for this a class of generalized shallow NNs given by an ensemble of $N$ multi-layer units, jointly trained using stochastic gradient descent (SGD) and possibly symmetry-leveraging (SL) techniques, such as Data Augmentation (DA), Feature Averaging (FA) or Equivariant Architectures (EA). We introduce the notions of weakly and strongly invariant laws (WI and SI) on the parameter space of each single unit, corresponding, respectively, to $G$-invariant distributions, and to distributions supported on parameters fixed by the group action (which encode EA). This allows us to define symmetric models compatible with taking $N\to\infty$ and give an interpretation of the asymptotic dynamics of DA, FA and EA in terms of Wasserstein Gradient Flows describing their MF limits. When activations respect the group action, we show that, for symmetric data, DA, FA and freely-trained models obey the exact same MF dynamic, which stays in the space of WI laws and minimizes therein the population risk. We also give a counterexample to the general attainability of an optimum over SI laws. Despite this, quite remarkably, we show that the set of SI laws is also preserved by the MF dynamics even when freely trained. This sharply contrasts the finite-$N$ setting, in which EAs are generally not preserved by unconstrained SGD. We illustrate the validity of our findings as $N$ gets larger in a teacher-student experimental setting, training a student NN to learn from a WI, SI or arbitrary teacher model through various SL schemes. We last deduce a data-driven heuristic to discover the largest subspace of parameters supporting SI distributions for a problem, that could be used for designing EA with minimal generalization error.

翻译：本文从平均场视角研究了在数据分布关于一般紧致群 $G$ 的作用具有对称性的条件下，过参数化人工神经网络的学习动态。为此，我们考虑一类广义浅层神经网络，它由 $N$ 个多层单元组成的集合构成，这些单元通过随机梯度下降联合训练，并可能采用对称性利用技术，例如数据增强、特征平均或等变架构。我们引入了单单元参数空间上的弱不变律和强不变律概念，分别对应于 $G$ 不变分布，以及支撑在群作用固定参数集上的分布。这使我们能够定义与取极限 $N\to\infty$ 兼容的对称模型，并用描述其平均场极限的 Wasserstein 梯度流来解释数据增强、特征平均和等变架构的渐近动态。当激活函数尊重群作用时，我们证明对于对称数据，数据增强、特征平均和自由训练模型遵循完全相同的平均场动态，该动态保持在弱不变律空间中并在其中最小化总体风险。我们还给出了一个反例，说明在强不变律上达到最优解在一般情况下不可实现。尽管如此，值得注意的是，我们证明了即使在自由训练下，强不变律集合也由平均场动态所保持。这与有限 $N$ 设置形成鲜明对比，在后者中，等变架构通常不被无约束的随机梯度下降所保持。我们在一个师生实验设置中，通过训练一个学生神经网络使用各种对称性利用方案从弱不变、强不变或任意的教师模型学习，验证了我们的发现在 $N$ 增大时的有效性。最后，我们推导出一种数据驱动的启发式方法，用于发现支撑问题强不变分布的最大参数子空间，这可用于设计具有最小泛化误差的等变架构。

相关内容

Neural Networks

关注 1653

神经网络（Neural Networks）是世界上三个最古老的神经建模学会的档案期刊:国际神经网络学会(INNS)、欧洲神经网络学会(ENNS)和日本神经网络学会(JNNS)。神经网络提供了一个论坛，以发展和培育一个国际社会的学者和实践者感兴趣的所有方面的神经网络和相关方法的计算智能。神经网络欢迎高质量论文的提交，有助于全面的神经网络研究，从行为和大脑建模，学习算法，通过数学和计算分析，系统的工程和技术应用，大量使用神经网络的概念和技术。这一独特而广泛的范围促进了生物和技术研究之间的思想交流，并有助于促进对生物启发的计算智能感兴趣的跨学科社区的发展。因此，神经网络编委会代表的专家领域包括心理学，神经生物学，计算机科学，工程，数学，物理。该杂志发表文章、信件和评论以及给编辑的信件、社论、时事、软件调查和专利信息。文章发表在五个部分之一:认知科学，神经科学，学习系统，数学和计算分析、工程和应用。官网地址：http://dblp.uni-trier.de/db/journals/nn/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日