General Distribution Learning: A theoretical framework for Deep Learning

There remain numerous unanswered research questions on deep learning (DL) within the classical learning theory framework. These include the remarkable generalization capabilities of overparametrized neural networks (NNs), the efficient optimization performance despite non-convexity of objectives, the mechanism of flat minima in generalization, and the exceptional performance of deep architectures, among others. This paper introduces a novel theoretical learning framework known as General Distribution Learning (GD Learning), which is designed to address a comprehensive range of machine learning and statistical tasks, including classification, regression and parameter estimation. Departing from statistical machine learning, GD Learning focuses on the true underlying distribution. In GD Learning, learning error, corresponding to the expected error in classical statistical learning framework, is divided into fitting errors caused by models and fitting algorithms, as well as sampling errors introduced by limited sampling data. The framework significantly incorporates prior knowledge, especially in scenarios characterized by data scarcity. This integration of external knowledge helps to minimize learning errors across the entire dataset, thereby enhancing performance. Within the GD Learning framework, we demonstrate that the global optimal solution to non-convex optimization problems, such as minimizing fitting error, can be approached by minimizing the gradient norm and the non-uniformity of the eigenvalues of the model's Jacobian matrix. This insight has led to the development of the gradient structure control algorithm. GD Learning also offers a fresh perspective on the questions on deep learning, including overparameterization and non-convex optimizations, bias-variance trade-off, and the mechanism of flat minima.

翻译：在经典学习理论框架内，深度学习仍存在众多未解答的研究问题，包括过参数化神经网络的卓越泛化能力、非凸目标函数下的高效优化性能、平坦极小值在泛化中的作用机制，以及深层架构的优异表现等。本文提出了一种名为"通用分布学习"的新型理论框架，旨在解决包括分类、回归和参数估计在内的广泛机器学习与统计任务。与统计机器学习不同，GD学习聚焦于真实潜在分布。在该框架中，学习误差（对应经典统计学习框架中的期望误差）被划分为由模型与拟合算法导致的拟合误差，以及有限采样数据引入的采样误差。该框架显著整合了先验知识，尤其在数据稀缺场景下，通过外部知识的融入有助于最小化整个数据集上的学习误差，从而提升性能。在GD学习框架内，我们证明了非凸优化问题（如最小化拟合误差）的全局最优解可通过最小化梯度范数及模型雅可比矩阵特征值的非均匀性来逼近。这一见解催生了梯度结构控制算法的开发。GD学习还为深度学习中的诸多问题提供了全新视角，包括过参数化与非凸优化、偏差-方差权衡，以及平坦极小值的作用机制。