Although Multi Armed Bandit (MAB) on one hand and the policy gradient approach on the other hand are among the most used frameworks of Reinforcement Learning, the theoretical properties of the policy gradient algorithm used for MAB have not been given enough attention. We investigate in this work the convergence of such a procedure for the situation when a $L2$ regularization term is present jointly with the 'softmax' parametrization. We prove convergence under appropriate technical hypotheses and test numerically the procedure including situations beyond the theoretical setting. The tests show that a time dependent regularized procedure can improve over the canonical approach especially when the initial guess is far from the solution.
翻译:尽管多臂老虎机(MAB)与策略梯度方法是强化学习中最常用的两大框架,但用于MAB的策略梯度算法的理论性质尚未得到充分关注。本文研究了在采用“softmax”参数化并引入$L2$正则化项的情况下,该算法的收敛性。我们在适当的技术假设下证明了其收敛性,并对包括超出理论设定范围的情形进行了数值测试。实验表明,随时间调整的正则化策略相较于经典方法能获得更优性能,尤其在初始猜测远离真实解时效果更为显著。