We study universal traits which emerge both in real-world complex datasets, as well as in artificially generated ones. Our approach is to analogize data to a physical system and employ tools from statistical physics and Random Matrix Theory (RMT) to reveal their underlying structure. We focus on the feature-feature covariance matrix, analyzing both its local and global eigenvalue statistics. Our main observations are: (i) The power-law scalings that the bulk of its eigenvalues exhibit are vastly different for uncorrelated random data compared to real-world data, (ii) this scaling behavior can be completely recovered by introducing long range correlations in a simple way to the synthetic data, (iii) both generated and real-world datasets lie in the same universality class from the RMT perspective, as chaotic rather than integrable systems, (iv) the expected RMT statistical behavior already manifests for empirical covariance matrices at dataset sizes significantly smaller than those conventionally used for real-world training, and can be related to the number of samples required to approximate the population power-law scaling behavior, (v) the Shannon entropy is correlated with local RMT structure and eigenvalues scaling, and substantially smaller in strongly correlated datasets compared to uncorrelated synthetic data, and requires fewer samples to reach the distribution entropy. These findings can have numerous implications to the characterization of the complexity of data sets, including differentiating synthetically generated from natural data, quantifying noise, developing better data pruning methods and classifying effective learning models utilizing these scaling laws.
翻译:我们研究了真实复杂数据集及人工生成数据集中涌现的普适特征。通过将数据类比为物理系统,并运用统计物理与随机矩阵理论(RMT)工具揭示其底层结构,我们重点分析了特征-特征协方差矩阵的局部与全局特征值统计性质。主要发现如下:(i)非相关随机数据与真实数据的特征值谱主体呈现截然不同的幂律标度行为;(ii)通过在合成数据中简单引入长程关联可完全复现该标度行为;(iii)从RMT视角看,生成数据集与真实数据集同属混沌系统而非可积系统的普适类;(iv)显著小于传统真实训练数据规模的经验协方差矩阵即可呈现预期的RMT统计行为,且该行为与逼近总体幂律标度行为所需样本量相关;(v)香农熵与局部RMT结构及特征值标度存在关联,强关联数据集的香农熵显著低于非相关合成数据,且达到分布熵所需的样本量更少。这些发现对数据集复杂度表征具有多重启示,包括区分合成数据与自然数据、量化噪声、发展更优数据剪枝方法,以及利用这些标度律分类高效学习模型。