Hidden Minima in Two-Layer ReLU Networks

The optimization problem associated to fitting two-layer ReLU networks having $d$~inputs, $k$~neurons, and labels generated by a target network, is considered. Two categories of infinite families of minima, giving one minimum per $d$ and $k$, were recently found. The loss at minima belonging to the first category converges to zero as $d$ increases. In the second category, the loss remains bounded away from zero. That being so, how may one avoid minima belonging to the latter category? Fortunately, such minima are never detected by standard optimization methods. Motivated by questions concerning the nature of this phenomenon, we develop methods to study distinctive analytic properties of hidden minima. By existing analyses, the Hessian spectrum of both categories agree modulus $O(d^{-1/2})$-terms -- not promising. Thus, rather, our investigation proceeds by studying curves along which the loss is minimized or maximized, referred to as tangency arcs. We prove that pure, seemingly remote, group representation-theoretic considerations concerning the arrangement of subspaces invariant to the action of subgroups of $S_d$, the symmetry group over $d$ symbols, relative to ones fixed by the action yield a precise description of all finitely many admissible types of tangency arcs. The general results applied for the loss function reveal that arcs emanating from hidden minima differ, characteristically, by their structure and symmetry, precisely on account of the $O(d^{-1/2})$-eigenvalue terms absent in previous work, indicating the subtly of the analysis. The theoretical results, stated and proved for o-minimal structures, show that the set comprising all tangency arcs is topologically sufficiently tame, permitting a numerical construction of tangency arcs, and ultimately, a comparison of how minima from both categories are positioned relative to adjacent critical points.

翻译：针对拟合具有$d$个输入、$k$个神经元且标签由目标网络生成的两层ReLU网络的优化问题，本文进行了研究。近期发现两类无限族极小值（每对$d$和$k$对应一个极小值）：第一类极小值的损失随$d$增大趋近于零，而第二类极小值的损失始终远离零。那么，如何避免落入第二类极小值？所幸标准优化方法从未检测到此类极小值。受该现象本质问题的启发，我们发展了研究隐藏极小值独特解析性质的方法。现有分析表明，两类极小值的Hessian谱在$O(d^{-1/2})$项精度内一致——这并不乐观。因此，我们转而研究损失函数最小化/最大化的曲线（称为切触弧）。我们证明：关于$d$个符号的对称群$S_d$的子群作用下不变子空间相对于群作用固定子空间的排列，其纯群表示论（看似遥远）的考量，能精确描述所有有限种可容许的切触弧类型。将该通用结论应用于损失函数后发现，源于隐藏极小值的切触弧在结构与对称性上具有特征性差异——这恰恰源于先前工作缺失的$O(d^{-1/2})$特征值项，揭示了分析的微妙性。在o-极小结构上陈述并证明的理论结果表明，所有切触弧构成的集合在拓扑上充分驯顺，可实现切触弧的数值构造，并最终比较两类极小值相对于相邻临界点的位置。