Local geometry of high-dimensional mixture models: Effective spectral theory and dynamical transitions

We study the local geometry of empirical risks in high dimensions via the spectral theory of their Hessian and information matrices. We focus on settings where the data, $(Y_\ell)_{\ell =1}^n \in \mathbb{R}^d$, are i.i.d. draws of a $k$-Gaussian mixture model, and the loss depends on the projection of the data into a fixed number of vectors, namely $\mathbf{x}^\top Y$, where $\mathbf{x}\in \mathbb{R}^{d\times C}$ are the parameters, and $C$ need not equal $k$. This setting captures a broad class of problems such as classification by one and two-layer networks and regression on multi-index models. We provide exact formulas for the limits of the empirical spectral distribution and outlier eigenvalues and eigenvectors of such matrices in the proportional asymptotics limit, where the number of samples and dimension $n,d\to\infty$ and $n/d=φ\in (0,\infty)$. These limits depend on the parameters $\mathbf{x}$ only through the summary statistic of the $(C+k)\times (C+k)$ Gram matrix of the parameters and class means, $\mathbf{G} = (\mathbf{x},\boldsymbolμ)^\top(\mathbf{x},\boldsymbolμ)$. It is known that under general conditions, when $\mathbf{x}$ is trained by online stochastic gradient descent, the evolution of these same summary statistics along training converges to the solution of an autonomous system of ODEs, called the effective dynamics. This enables us to connect the training dynamics to the spectral theory of these matrices generated with test data. We demonstrate our general results by analyzing the effective spectrum along the effective dynamics in the case of multi-class logistic regression. In this setting, the empirical Hessian and information matrices have substantially different spectra, each with their own static and even dynamical spectral transitions.

翻译：我们通过经验风险的Hessian矩阵与信息矩阵的谱理论，研究高维情形下经验风险的局部几何性质。我们重点关注数据$(Y_\ell)_{\ell =1}^n \in \mathbb{R}^d$为$k$分量高斯混合模型的独立同分布样本，且损失函数仅依赖于数据在固定数量向量上的投影，即$\mathbf{x}^\top Y$的情形，其中$\mathbf{x}\in \mathbb{R}^{d\times C}$为参数，且$C$不必等于$k$。该框架涵盖了一大类问题，如单层与双层神经网络的分类任务以及多指标模型上的回归问题。在比例渐近极限（样本数与维度$n,d\to\infty$且$n/d=φ\in (0,\infty)$）下，我们给出了此类矩阵的经验谱分布、异常特征值与特征向量极限的精确表达式。这些极限仅通过参数与类别均值构成的$(C+k)\times (C+k)$ Gram矩阵汇总统计量$\mathbf{G} = (\mathbf{x},\boldsymbolμ)^\top(\mathbf{x},\boldsymbolμ)$依赖于参数$\mathbf{x}$。已知在一般条件下，当$\mathbf{x}$通过在线随机梯度下降训练时，这些汇总统计量在训练过程中的演化会收敛到一个自治常微分方程系统的解，称为有效动力学。这使得我们能够将训练动力学与测试数据生成的这些矩阵的谱理论联系起来。我们通过分析多类逻辑回归情形下沿有效动力学的有效谱来展示一般性结果。在此设定下，经验Hessian矩阵与信息矩阵具有显著不同的谱结构，各自展现出静态乃至动力学的谱相变。