Like many optimization algorithms, Stochastic Variational Inference (SVI) is sensitive to the choice of the learning rate. If the learning rate is too small, the optimization process may be slow, and the algorithm might get stuck in local optima. On the other hand, if the learning rate is too large, the algorithm may oscillate or diverge, failing to converge to a solution. Adaptive learning rate methods such as Adam, AdaMax, Adagrad, or RMSprop automatically adjust the learning rate based on the history of gradients. Nevertheless, if the base learning rate is too large, the variational parameters might still oscillate around the optimal solution. With learning rate schedules, the learning rate can be reduced gradually to mitigate this problem. However, the amount at which the learning rate should be decreased in each iteration is not known a priori, which can significantly impact the performance of the optimization. In this work, we propose a method to decay the learning rate based on the history of the variational parameters. We use an empirical measure to quantify the amount of oscillations against the progress of the variational parameters to adapt the learning rate. The approach requires little memory and is computationally efficient. We demonstrate in various numerical examples that our method reduces the sensitivity of the optimization performance to the learning rate and that it can also be used in combination with other adaptive learning rate methods.
翻译:与许多优化算法类似,随机变分推断(SVI)对学习率的选择非常敏感。如果学习率过小,优化过程可能缓慢,且算法可能陷入局部最优解;反之,如果学习率过大,算法可能出现振荡或发散,无法收敛到解。自适应学习率方法(如Adam、AdaMax、Adagrad或RMSprop)基于梯度历史自动调整学习率。然而,若基础学习率过大,变分参数仍可能在最优解附近振荡。通过采用学习率调度策略,可以逐步降低学习率以缓解此问题。但每次迭代中学习率应降低的幅度无法先验确定,这会显著影响优化性能。本文提出一种基于变分参数历史动态衰减学习率的方法。我们采用经验度量来量化变分参数振荡程度与其进展之间的关系,从而自适应调整学习率。该方法内存占用少且计算高效。通过多个数值实验表明,我们的方法能够降低优化性能对学习率的敏感性,并可与其他自适应学习率方法结合使用。