Although Multi Armed Bandit (MAB) on one hand and the policy gradient approach on the other hand are among the most used frameworks of Reinforcement Learning, the theoretical properties of the policy gradient algorithm used for MAB have not been given enough attention. We investigate in this work the convergence of such a procedure for the situation when a $L2$ regularization term is present jointly with the 'softmax' parametrization. We prove convergence under appropriate technical hypotheses and test numerically the procedure including situations beyond the theoretical setting. The tests show that a time dependent regularized procedure can improve over the canonical approach especially when the initial guess is far from the solution.
翻译:尽管多臂老虎机(MAB)与策略梯度方法分别是强化学习中最常用的框架之一,但用于MAB的策略梯度算法的理论性质尚未得到充分关注。本文研究了在采用“softmax”参数化的同时引入$L2$正则化项时该算法的收敛性。我们在适当的技术假设下证明了其收敛性,并在超出理论设定的场景中进行了数值实验。实验结果表明,时间依赖的正则化方法能够超越经典方法,尤其在初始猜测远离最优解时效果更为显著。