Neural networks excel at discovering statistical patterns in high-dimensional data sets. In practice, higher-order cumulants, which quantify the non-Gaussian correlations between three or more variables, are particularly important for the performance of neural networks. But how efficient are neural networks at extracting features from higher-order cumulants? We study this question in the spiked cumulant model, where the statistician needs to recover a privileged direction or "spike" from the order-$p\ge 4$ cumulants of~$d$-dimensional inputs. We first characterise the fundamental statistical and computational limits of recovering the spike by analysing the number of samples~$n$ required to strongly distinguish between inputs from the spiked cumulant model and isotropic Gaussian inputs. We find that statistical distinguishability requires $n\gtrsim d$ samples, while distinguishing the two distributions in polynomial time requires $n \gtrsim d^2$ samples for a wide class of algorithms, i.e. those covered by the low-degree conjecture. These results suggest the existence of a wide statistical-to-computational gap in this problem. Numerical experiments show that neural networks learn to distinguish the two distributions with quadratic sample complexity, while "lazy" methods like random features are not better than random guessing in this regime. Our results show that neural networks extract information from higher-order correlations in the spiked cumulant model efficiently, and reveal a large gap in the amount of data required by neural networks and random features to learn from higher-order cumulants.
翻译:神经网络擅长发现高维数据集中的统计模式。在实践中,高阶累积量(用于量化三个或更多变量之间的非高斯相关性)对神经网络的性能尤为重要。但神经网络从高阶累积量中提取特征的效率如何?我们在尖峰累积量模型中研究这一问题,其中统计学家需要从d维输入的p≥4阶累积量中恢复一个特权方向或“尖峰”。我们首先通过分析区分尖峰累积量模型输入与各向同性高斯输入所需的样本量n,来刻画恢复尖峰的统计和计算极限。研究发现:统计可区分性需要n≳d个样本,而在一大类算法(即低度猜想所覆盖的算法)中,多项式时间内区分这两种分布需要n≳d²个样本。这些结果表明该问题存在显著的统计-计算差距。数值实验显示,神经网络能以二次样本复杂度学会区分这两种分布,而随机特征等“懒惰”方法在此情形下并不优于随机猜测。我们的结果表明:神经网络能从尖峰累积量模型的高阶相关性中高效提取信息,并揭示了神经网络与随机特征从高阶累积量学习所需数据量的巨大差距。