Due to the non-convex nature of training Deep Neural Network (DNN) models, their effectiveness relies on the use of non-convex optimization heuristics. Traditional methods for training DNNs often require costly empirical methods to produce successful models and do not have a clear theoretical foundation. In this study, we examine the use of convex optimization theory and sparse recovery models to refine the training process of neural networks and provide a better interpretation of their optimal weights. We focus on training two-layer neural networks with piecewise linear activations and demonstrate that they can be formulated as a finite-dimensional convex program. These programs include a regularization term that promotes sparsity, which constitutes a variant of group Lasso. We first utilize semi-infinite programming theory to prove strong duality for finite width neural networks and then we express these architectures equivalently as high dimensional convex sparse recovery models. Remarkably, the worst-case complexity to solve the convex program is polynomial in the number of samples and number of neurons when the rank of the data matrix is bounded, which is the case in convolutional networks. To extend our method to training data of arbitrary rank, we develop a novel polynomial-time approximation scheme based on zonotope subsampling that comes with a guaranteed approximation ratio. We also show that all the stationary of the nonconvex training objective can be characterized as the global optimum of a subsampled convex program. Our convex models can be trained using standard convex solvers without resorting to heuristics or extensive hyper-parameter tuning unlike non-convex methods. Through extensive numerical experiments, we show that convex models can outperform traditional non-convex methods and are not sensitive to optimizer hyperparameters.
翻译:由于深度神经网络(DNN)模型的非凸训练特性,其有效性高度依赖非凸优化启发式方法。传统DNN训练方法通常依赖成本高昂的经验方法生成成功模型,且缺乏清晰的理论基础。本研究探讨利用凸优化理论与稀疏恢复模型优化神经网络训练过程,并对其最优权重提供更深入的理论解释。我们聚焦于训练具有分段线性激活函数的两层神经网络,证明此类网络可被转化为有限维凸规划问题。这些规划包含促进稀疏性的正则化项,构成了群组Lasso的变体。我们首先利用半无限规划理论证明有限宽度神经网络的强对偶性,随后将这些架构等价表示为高维凸稀疏恢复模型。值得注意的是,当数据矩阵秩有界时(这在卷积网络中常见),该凸规划的最坏情况复杂度在样本数与神经元数上呈多项式级。为将方法扩展至任意秩的训练数据,我们基于带状多面体子采样提出具有保证近似比的多项式时间近似方案。我们进一步证明:非凸训练目标的所有驻点均可表征为子采样凸规划的全局最优解。与需要启发式方法或广泛超参数调优的非凸方法不同,我们的凸模型可通过标准凸优化求解器直接训练。大量数值实验表明,凸模型不仅优于传统非凸方法,而且对优化器超参数不敏感。