While classical in many theoretical settings - and in particular in statistical physics-inspired works - the assumption of Gaussian i.i.d. input data is often perceived as a strong limitation in the context of statistics and machine learning. In this study, we redeem this line of work in the case of generalized linear classification, a.k.a. the perceptron model, with random labels. We argue that there is a large universality class of high-dimensional input data for which we obtain the same minimum training loss as for Gaussian data with corresponding data covariance. In the limit of vanishing regularization, we further demonstrate that the training loss is independent of the data covariance. On the theoretical side, we prove this universality for an arbitrary mixture of homogeneous Gaussian clouds. Empirically, we show that the universality holds also for a broad range of real datasets.
翻译:虽然在许多理论设定中(尤其是受统计物理启发的研究中)经典假设输入数据服从高斯独立同分布,但在统计学和机器学习领域,这一假设常被视为严格的限制。本研究针对广义线性分类(即感知机模型)中随机标签的情况,重新验证了这一研究路线的有效性。我们论证,存在一个高维输入数据的大通用性类,其最小训练损失与具有相应数据协方差的高斯数据相同。在正则化趋近于零的极限下,我们进一步证明训练损失独立于数据协方差。在理论层面,我们证明了任意齐次高斯云混合的通用性;在实证方面,我们展示该通用性也广泛适用于多种真实数据集。