Adam is the de facto optimization algorithm for several deep learning applications, but an understanding of its implicit bias and how it differs from other algorithms, particularly standard first-order methods such as (stochastic) gradient descent (GD), remains limited. In practice, neural networks (NNs) trained with SGD are known to exhibit simplicity bias -- a tendency to find simple solutions. In contrast, we show that Adam is more resistant to such simplicity bias. First, we investigate the differences in the implicit biases of Adam and GD when training two-layer ReLU NNs on a binary classification task with Gaussian data. We find that GD exhibits a simplicity bias, resulting in a linear decision boundary with a suboptimal margin, whereas Adam leads to much richer and more diverse features, producing a nonlinear boundary that is closer to the Bayes' optimal predictor. This richer decision boundary also allows Adam to achieve higher test accuracy both in-distribution and under certain distribution shifts. We theoretically prove these results by analyzing the population gradients. Next, to corroborate our theoretical findings, we present extensive empirical results showing that this property of Adam leads to superior generalization across various datasets with spurious correlations where NNs trained with SGD are known to show simplicity bias and do not generalize well under certain distributional shifts.
翻译:Adam是多种深度学习应用中的事实标准优化算法,然而对其隐式偏好及其与其他算法(特别是标准一阶方法如(随机)梯度下降法(GD))差异的理解仍然有限。在实践中,已知使用SGD训练的神经网络(NNs)会表现出简单性偏好——即倾向于寻找简单解。与之相反,我们证明Adam对此类简单性偏好的抵抗性更强。首先,我们在高斯数据二元分类任务上训练两层ReLU神经网络时,探究了Adam与GD隐式偏好的差异。我们发现GD表现出简单性偏好,导致产生具有次优裕度的线性决策边界;而Adam则导向更丰富、更多样化的特征,产生更接近贝叶斯最优预测器的非线性边界。这种更丰富的决策边界也使Adam能够在分布内及特定分布偏移下获得更高的测试准确率。我们通过分析总体梯度从理论上证明了这些结果。其次,为佐证我们的理论发现,我们提供了广泛的实证结果,表明Adam的这一特性在多种存在虚假相关性的数据集上带来了更优的泛化性能——在这些场景中,已知使用SGD训练的神经网络会表现出简单性偏好,且在特定分布偏移下泛化能力不佳。