Conjugate Learning Theory: Uncovering the Mechanisms of Trainability and Generalization in Deep Neural Networks

In this work, we propose a notion of practical learnability grounded in finite sample settings, and develop a conjugate learning theoretical framework based on convex conjugate duality to characterize this learnability property. Building on this foundation, we demonstrate that training deep neural networks (DNNs) with mini-batch stochastic gradient descent (SGD) achieves global optima of empirical risk by jointly controlling the extreme eigenvalues of a structure matrix and the gradient energy, and we establish a corresponding convergence theorem. We further elucidate the impact of batch size and model architecture (including depth, parameter count, sparsity, skip connections, and other characteristics) on non-convex optimization. Additionally, we derive a model-agnostic lower bound for the achievable empirical risk, theoretically demonstrating that data determines the fundamental limit of trainability. On the generalization front, we derive deterministic and probabilistic bounds on generalization error based on generalized conditional entropy measures. The former explicitly delineates the range of generalization error, while the latter characterizes the distribution of generalization error relative to the deterministic bounds under independent and identically distributed (i.i.d.) sampling conditions. Furthermore, these bounds explicitly quantify the influence of three key factors: (i) information loss induced by irreversibility in the model, (ii) the maximum attainable loss value, and (iii) the generalized conditional entropy of features with respect to labels. Moreover, they offer a unified theoretical lens for understanding the roles of regularization, irreversible transformations, and network depth in shaping the generalization behavior of deep neural networks. Extensive experiments validate all theoretical predictions, confirming the framework's correctness and consistency.

翻译：本文提出了一种基于有限样本场景的实用可学习性概念，并基于凸共轭对偶性构建了共轭学习理论框架以刻画这一可学习性特性。在此基础上，我们证明了采用小批量随机梯度下降（SGD）训练深度神经网络（DNNs）能够通过联合控制结构矩阵的极端特征值与梯度能量，达到经验风险的全局最优解，并建立了相应的收敛定理。我们进一步阐明了批量大小与模型架构（包括深度、参数量、稀疏性、跳跃连接等特性）对非凸优化的影响。此外，我们推导了经验风险可实现性的模型无关下界，从理论上证明数据决定了可训练性的根本极限。在泛化方面，我们基于广义条件熵测度推导了泛化误差的确定性界与概率性界。前者明确界定了泛化误差的取值范围，后者则在独立同分布（i.i.d.）采样条件下刻画了泛化误差相对于确定性界的分布特性。这些界显式量化了三个关键因素的影响：(i) 模型不可逆性引起的信息损失，(ii) 可达损失最大值，以及(iii) 特征相对于标签的广义条件熵。此外，它们为理解正则化、不可逆变换及网络深度在塑造深度神经网络泛化行为中的作用提供了统一的理论视角。大量实验验证了所有理论预测，证实了该框架的正确性与一致性。