In this manuscript we consider the problem of generalized linear estimation on Gaussian mixture data with labels given by a single-index model. Our first result is a sharp asymptotic expression for the test and training errors in the high-dimensional regime. Motivated by the recent stream of results on the Gaussian universality of the test and training errors in generalized linear estimation, we ask ourselves the question: "when is a single Gaussian enough to characterize the error?". Our formula allow us to give sharp answers to this question, both in the positive and negative directions. More precisely, we show that the sufficient conditions for Gaussian universality (or lack of thereof) crucially depend on the alignment between the target weights and the means and covariances of the mixture clusters, which we precisely quantify. In the particular case of least-squares interpolation, we prove a strong universality property of the training error, and show it follows a simple, closed-form expression. Finally, we apply our results to real datasets, clarifying some recent discussion in the literature about Gaussian universality of the errors in this context.
翻译:本文研究了基于高斯混合数据且标签由单指标模型给出的广义线性估计问题。我们首先给出了高维情形下测试误差与训练误差的精确渐近表达式。受近期关于广义线性估计中测试与训练误差高斯普适性研究结果启发,我们提出问题:“何时单个高斯分布足以表征误差?”我们的公式为这一问题提供了精确的肯定与否定解答。具体而言,我们表明高斯普适性存在或不存在的充分条件关键取决于目标权重与混合聚类均值及协方差的匹配程度,并对此进行了精确量化。在最小二乘插值特殊情形下,我们证明了训练误差具有强普适性,并给出其简洁闭式表达式。最后,我们将结果应用于真实数据集,澄清了近期文献中关于该背景下误差高斯普适性的相关讨论。