Hidden Minima in Two-Layer ReLU Networks

The optimization problem associated to fitting two-layer ReLU networks having $d$~inputs, $k$~neurons, and labels generated by a target network, is considered. Two types of infinite families of spurious minima, giving one minimum per $d$, were recently found. The loss at minima belonging to the first type converges to zero as $d$ increases. In the second type, the loss remains bounded away from zero. That being so, how may one avoid minima belonging to the latter type? Fortunately, such minima are never detected by standard optimization methods. Motivated by questions concerning the nature of this phenomenon, we develop methods to study distinctive analytic properties of hidden minima. By existing analyses, the Hessian spectrum of both types agree modulo $O(d^{-1/2})$-terms -- not promising. Thus, rather, our investigation proceeds by studying curves along which the loss is minimized or maximized, generally referred to as tangency arcs. We prove that apparently far removed group representation-theoretic considerations concerning the arrangement of subspaces invariant to the action of subgroups of $S_d$, the symmetry group over $d$ symbols, relative to ones fixed by the action yield a precise description of all finitely many admissible types of tangency arcs. The general results used for the loss function reveal that arcs emanating from hidden minima differ, characteristically, by their structure and symmetry, precisely on account of the $O(d^{-1/2})$-eigenvalue terms absent in previous work, indicating in particular the subtlety of the analysis. The theoretical results, stated and proved for o-minimal structures, show that the set comprising all tangency arcs is topologically sufficiently tame to enable a numerical construction of tangency arcs and so compare how minima, both types, are positioned relative to adjacent critical points.

翻译：考虑与拟合具有$d$个输入、$k$个神经元且标签由目标网络生成的两层ReLU网络相关的优化问题。近期发现了两种类型的无限虚假极小点族，每种类型每$d$值对应一个极小点。第一类极小点的损失随$d$增大而收敛至零；第二类极小点的损失则始终远离零。那么，如何避免后一类极小点？幸运的是，标准优化方法永远不会检测到这类极小点。受这一现象本质问题的驱动，我们发展了研究隐藏极小点独特解析性质的方法。根据现有分析，两类极小点的Hessian谱在$O(d^{-1/2})$项范围内一致——这并不乐观。因此，我们的研究转而通过考察损失函数被最小化或最大化的曲线（统称为切弧）展开。我们证明，表面上相去甚远的群表示论考量——关于$d$符号对称群$S_d$子群作用下的不变子空间排列，相对于固定作用子空间的排列——能够精确描述所有有限个可容许的切弧类型。用于损失函数的一般结果表明，从隐藏极小点出发的切弧在结构和对称性上具有特征性差异，这正是由于先前工作中忽略的$O(d^{-1/2})$特征值项所致，尤其体现了分析的微妙性。针对o-minimal结构陈述并证明的理论结果表明，所有切弧构成的集合在拓扑上具有充分的驯顺性，使得切弧的数值构造成为可能，从而可以比较两类极小点相对于相邻临界点的位置关系。