Multi Armed Bandit (MAB) algorithms are a cornerstone of reinforcement learning and have been studied both theoretically and numerically. One of the most commonly used implementation uses a softmax mapping to prescribe the optimal policy and served as the foundation for downstream algorithms, including REINFORCE. Distinct from vanilla approaches, we consider here the L2 regularized softmax policy gradient where a quadratic term is subtracted from the mean reward. Previous studies exploiting convexity failed to identify a suitable theoretical framework to analyze its convergence when the regularization parameter vanishes. We prove here theoretical convergence results and confirm empirically that this regime makes the L2 regularization numerically advantageous on standard benchmarks.
翻译:多臂赌博机(MAB)算法是强化学习的基石,并已在理论和数值上得到研究。最常用的实现之一采用softmax映射来规定最优策略,并作为下游算法(包括REINFORCE)的基础。与标准方法不同,我们在此考虑L2正则化的softmax策略梯度,其中从平均奖励中减去二次项。先前利用凸性的研究未能找到合适的理论框架来分析当正则化参数消失时的收敛性。我们在此证明了理论收敛结果,并通过实验确认该机制使得L2正则化在标准基准测试中具有数值优势。