We develop a novel framework that adds the regularizers of the sparse group lasso to a family of adaptive optimizers in deep learning, such as Momentum, Adagrad, Adam, AMSGrad, AdaHessian, and create a new class of optimizers, which are named Group Momentum, Group Adagrad, Group Adam, Group AMSGrad and Group AdaHessian, etc., accordingly. We establish theoretically proven convergence guarantees in the stochastic convex settings, based on primal-dual methods. We evaluate the regularized effect of our new optimizers on three large-scale real-world ad click datasets with state-of-the-art deep learning models. The experimental results reveal that compared with the original optimizers with the post-processing procedure which uses the magnitude pruning method, the performance of the models can be significantly improved on the same sparsity level. Furthermore, in comparison to the cases without magnitude pruning, our methods can achieve extremely high sparsity with significantly better or highly competitive performance. The code is available at https://github.com/intelligent-machine-learning/dlrover/blob/master/tfplus.
翻译:我们提出了一种新颖的框架,将稀疏组Lasso的正则化项引入深度学习中的一类自适应优化器(包括Momentum、Adagrad、Adam、AMSGrad、AdaHessian等),并据此创建了新的优化器类别,分别命名为Group Momentum、Group Adagrad、Group Adam、Group AMSGrad和Group AdaHessian。基于原对偶方法,我们在随机凸优化场景下建立了理论上可证明的收敛性保证。我们使用最先进的深度学习模型,在三个大规模真实广告点击数据集上评估了新优化器的正则化效果。实验结果表明,与采用幅度剪枝后处理过程的原始优化器相比,在相同稀疏度水平下,模型性能得到显著提升。此外,与未进行幅度剪枝的情况相比,我们的方法能够在实现极高稀疏度的同时,保持显著更优或极具竞争力的性能。相关代码已开源在https://github.com/intelligent-machine-learning/dlrover/blob/master/tfplus。