We present a self-contained proof of the convergence rate of the Stochastic Gradient Descent (SGD) when the learning rate follows an inverse time decays schedule; we next apply the results to the convergence of a modified form of policy gradient Multi-Armed Bandit (MAB) with $L2$ regularization.
翻译:我们给出当学习率遵循逆时间衰减调度时随机梯度下降(SGD)收敛率的一个自包含证明;随后我们将这些结果应用于带$L2$正则化的改进型策略梯度多臂赌博机(MAB)的收敛性分析。