We propose a Regularized Adaptive Momentum Dual Averaging (RAMDA) algorithm for training structured neural networks. Similar to existing regularized adaptive methods, the subproblem for computing the update direction of RAMDA involves a nonsmooth regularizer and a diagonal preconditioner, and therefore does not possess a closed-form solution in general. We thus also carefully devise an implementable inexactness condition that retains convergence guarantees similar to the exact versions, and propose a companion efficient solver for the subproblems of both RAMDA and existing methods to make them practically feasible. We leverage the theory of manifold identification in variational analysis to show that, even in the presence of such inexactness, the iterates of RAMDA attain the ideal structure induced by the regularizer at the stationary point of asymptotic convergence. This structure is locally optimal near the point of convergence, so RAMDA is guaranteed to obtain the best structure possible among all methods converging to the same point, making it the first regularized adaptive method outputting models that possess outstanding predictive performance while being (locally) optimally structured. Extensive numerical experiments in large-scale modern computer vision, language modeling, and speech tasks show that the proposed RAMDA is efficient and consistently outperforms state of the art for training structured neural network. Implementation of our algorithm is available at http://www.github.com/ismoptgroup/RAMDA/.
翻译:我们提出了一种正则化自适应动量对偶平均(RAMDA)算法,用于训练结构化神经网络。与现有正则化自适应方法类似,RAMDA在计算更新方向时的子问题涉及非光滑正则化项和对角预处理器,因此通常不存在闭式解。为此,我们精心设计了一个可实现的非精确性条件,该条件在保留与精确版本相当的收敛保证的同时,提出了一种适用于RAMDA及现有方法子问题的配套高效求解器,使其在实践中可行。我们利用变分分析中的流形识别理论证明,即使存在这种非精确性,RAMDA的迭代仍能渐近收敛到由正则化器在稳定点诱导的理想结构。该结构在收敛点附近局部最优,因此RAMDA可保证在收敛到同一点的所有方法中获得最佳可能结构,成为首个输出兼具卓越预测性能与(局部)最优结构模型的现有正则化自适应方法。在大规模现代计算机视觉、语言建模和语音任务上的广泛数值实验表明,所提出的RAMDA算法高效且持续优于当前最先进的结构化神经网络训练方法。算法实现见 http://www.github.com/ismoptgroup/RAMDA/。