The Underlying Scaling Laws and Universal Statistical Structure of Complex Datasets

We study universal traits which emerge both in real-world complex datasets, as well as in artificially generated ones. Our approach is to analogize data to a physical system and employ tools from statistical physics and Random Matrix Theory (RMT) to reveal their underlying structure. We focus on the feature-feature covariance matrix, analyzing both its local and global eigenvalue statistics. Our main observations are: (i) The power-law scalings that the bulk of its eigenvalues exhibit are vastly different for uncorrelated normally distributed data compared to real-world data, (ii) this scaling behavior can be completely modeled by generating gaussian data with long range correlations, (iii) both generated and real-world datasets lie in the same universality class from the RMT perspective, as chaotic rather than integrable systems, (iv) the expected RMT statistical behavior already manifests for empirical covariance matrices at dataset sizes significantly smaller than those conventionally used for real-world training, and can be related to the number of samples required to approximate the population power-law scaling behavior, (v) the Shannon entropy is correlated with local RMT structure and eigenvalues scaling, and substantially smaller in strongly correlated datasets compared to uncorrelated synthetic data, and requires fewer samples to reach the distribution entropy. These findings show that with sufficient sample size, the Gram matrix of natural image datasets can be well approximated by a Wishart random matrix with a simple covariance structure, opening the door to rigorous studies of neural network dynamics and generalization which rely on the data Gram matrix.

翻译：我们研究了现实世界复杂数据集以及人工生成数据集中涌现的普适性特征。通过将数据类比为物理系统，并运用统计物理与随机矩阵理论（RMT）工具，我们揭示了其底层结构。我们重点分析了特征-特征协方差矩阵的局部与全局特征值统计特性。主要发现包括：（i）与真实世界数据相比，不相关正态分布数据的特征值体所呈现的幂律缩放行为存在显著差异；（ii）这种缩放行为可通过生成具有长程关联的高斯数据得到完整建模；（iii）从RMT视角看，生成数据集与真实数据集同属于混沌系统（而非可积系统）的普适类；（iv）远小于实际训练规模的实证协方差矩阵即可展现预期的RMT统计行为，且该行为与逼近总体幂律缩放行为所需样本数相关；（v）香农熵与局部RMT结构及特征值缩放存在相关性，且在强关联数据集中的数值显著低于不相关合成数据，达到分布熵所需样本数也更少。这些结果表明，当样本量充足时，自然图像数据集的Gram矩阵可通过具有简单协方差结构的Wishart随机矩阵进行良好近似，为依赖数据Gram矩阵的神经网络动力学与泛化能力的严格研究开辟了道路。