The generalization performance of deep neural networks with regard to the optimization algorithm is one of the major concerns in machine learning. This performance can be affected by various factors. In this paper, we theoretically prove that the Lipschitz constant of a loss function is an important factor to diminish the generalization error of the output model obtained by Adam or AdamW. The results can be used as a guideline for choosing the loss function when the optimization algorithm is Adam or AdamW. In addition, to evaluate the theoretical bound in a practical setting, we choose the human age estimation problem in computer vision. For assessing the generalization better, the training and test datasets are drawn from different distributions. Our experimental evaluation shows that the loss function with a lower Lipschitz constant and maximum value improves the generalization of the model trained by Adam or AdamW.
翻译:深度学习网络的泛化性能与优化算法之间的关系是机器学习领域的主要关注点之一,该性能可能受到多种因素的影响。本文从理论上证明,损失函数的Lipschitz常数是降低Adam或AdamW优化器所得输出模型泛化误差的重要因素。该结果可指导在采用Adam或AdamW优化算法时如何选择损失函数。此外,为在实际场景中评估理论界限,我们选取计算机视觉中的人体年龄估计问题作为案例。为更有效地评估泛化能力,训练集和测试集采用不同分布数据。实验评估表明,具有较低Lipschitz常数和最大值的损失函数能改善Adam或AdamW训练模型的泛化性能。