Let $(x_{i}, y_{i})_{i=1,\dots,n}$ denote independent samples from a general mixture distribution $\sum_{c\in\mathcal{C}}\rho_{c}P_{c}^{x}$, and consider the hypothesis class of generalized linear models $\hat{y} = F(\Theta^{\top}x)$. In this work, we investigate the asymptotic joint statistics of the family of generalized linear estimators $(\Theta_{1}, \dots, \Theta_{M})$ obtained either from (a) minimizing an empirical risk $\hat{R}_{n}(\Theta;X,y)$ or (b) sampling from the associated Gibbs measure $\exp(-\beta n \hat{R}_{n}(\Theta;X,y))$. Our main contribution is to characterize under which conditions the asymptotic joint statistics of this family depends (on a weak sense) only on the means and covariances of the class conditional features distribution $P_{c}^{x}$. In particular, this allow us to prove the universality of different quantities of interest, such as the training and generalization errors, redeeming a recent line of work in high-dimensional statistics working under the Gaussian mixture hypothesis. Finally, we discuss the applications of our results to different machine learning tasks of interest, such as ensembling and uncertainty
翻译:令 $(x_{i}, y_{i})_{i=1,\dots,n}$ 表示来自一般混合分布 $\sum_{c\in\mathcal{C}}\rho_{c}P_{c}^{x}$ 的独立样本,并考虑广义线性模型 $\hat{y} = F(\Theta^{\top}x)$ 的假设类别。本文研究广义线性估计子族 $(\Theta_{1}, \dots, \Theta_{M})$ 的渐近联合统计特性,该族估计子可通过以下两种途径获得:(a) 最小化经验风险 $\hat{R}_{n}(\Theta;X,y)$,或 (b) 从关联的吉布斯测度 $\exp(-\beta n \hat{R}_{n}(\Theta;X,y))$ 中采样。我们的主要贡献在于刻画该族估计子的渐近联合统计量(在弱意义下)仅依赖于类别条件特征分布 $P_{c}^{x}$ 的均值和协方差的条件。这一发现使我们能够证明不同感兴趣量(如训练误差与泛化误差)的普适性,从而验证了近期在高维统计学领域中基于高斯混合假设的研究工作。最后,我们讨论了研究结果在若干机器学习任务(如集成学习与不确定性估计)中的应用。