Backpropagation with gradient descent is a common optimization strategy employed by most neural network architectures in machine learning. However, finding optimal hyperparameters to guide training has proven challenging. While it is widely acknowledged that selecting appropriate parameters is crucial for avoiding overfitting and achieving unbiased outcomes, this choice remains largely based on empirical experiments and experience. This paper presents a new probabilistic framework for the learning rate, a key parameter in stochastic gradient descent. The framework develops classic Bayesian statistics into a double-Bayesian decision mechanism involving two antagonistic Bayesian processes. A theoretically optimal learning rate can be derived from these two processes and used for stochastic gradient descent. Experiments across various classification, segmentation, and detection tasks corroborate the practical significance of the theoretically derived learning rate. The paper also discusses the ramifications of the proposed double-Bayesian framework for network training and model performance.
翻译:梯度下降反向传播是大多数机器学习神经网络架构采用的常见优化策略。然而,寻找最优超参数来指导训练已被证明具有挑战性。虽然有共识认为选择适当参数对于避免过拟合和获得无偏结果至关重要,但这一选择在很大程度上仍基于经验性实验和经验。本文提出了一种针对学习率(随机梯度下降中的关键参数)的新概率框架。该框架将经典贝叶斯统计发展为一种双贝叶斯决策机制,涉及两个对抗性贝叶斯过程。由此可以推导出理论最优学习率,并用于随机梯度下降。在各类分类、分割和检测任务上的实验证实了理论推导学习率的实际重要性。本文还讨论了所提出的双贝叶斯框架对网络训练和模型性能的影响。