Optimizing deep neural networks is one of the main tasks in successful deep learning. Current state-of-the-art optimizers are adaptive gradient-based optimization methods such as Adam. Recently, there has been an increasing interest in formulating gradient-based optimizers in a probabilistic framework for better estimation of gradients and modeling uncertainties. Here, we propose to combine both approaches, resulting in the Variational Stochastic Gradient Descent (VSGD) optimizer. We model gradient updates as a probabilistic model and utilize stochastic variational inference (SVI) to derive an efficient and effective update rule. Further, we show how our VSGD method relates to other adaptive gradient-based optimizers like Adam. Lastly, we carry out experiments on two image classification datasets and four deep neural network architectures, where we show that VSGD outperforms Adam and SGD.
翻译:优化深度神经网络是成功深度学习的主要任务之一。当前最先进的优化器是基于梯度的自适应优化方法,如Adam。近年来,人们越来越关注在概率框架中构建基于梯度的优化器,以更好地估计梯度和建模不确定性。在此,我们提出结合这两种方法,由此产生变分随机梯度下降(VSGD)优化器。我们将梯度更新建模为概率模型,并利用随机变分推断(SVI)推导出高效且有效的更新规则。此外,我们展示了VSGD方法如何与Adam等其他自适应梯度优化器相关联。最后,我们在两个图像分类数据集和四种深度神经网络架构上进行了实验,结果表明VSGD的性能优于Adam和SGD。