We consider (stochastic) softmax policy gradient (PG) methods for bandits and tabular Markov decision processes (MDPs). While the PG objective is non-concave, recent research has used the objective's smoothness and gradient domination properties to achieve convergence to an optimal policy. However, these theoretical results require setting the algorithm parameters according to unknown problem-dependent quantities (e.g. the optimal action or the true reward vector in a bandit problem). To address this issue, we borrow ideas from the optimization literature to design practical, principled PG methods in both the exact and stochastic settings. In the exact setting, we employ an Armijo line-search to set the step-size for softmax PG and demonstrate a linear convergence rate. In the stochastic setting, we utilize exponentially decreasing step-sizes, and characterize the convergence rate of the resulting algorithm. We show that the proposed algorithm offers similar theoretical guarantees as the state-of-the art results, but does not require the knowledge of oracle-like quantities. For the multi-armed bandit setting, our techniques result in a theoretically-principled PG algorithm that does not require explicit exploration, the knowledge of the reward gap, the reward distributions, or the noise. Finally, we empirically compare the proposed methods to PG approaches that require oracle knowledge, and demonstrate competitive performance.
翻译:本文研究用于强盗问题与表格化马尔可夫决策过程(MDP)的(随机)Softmax策略梯度(PG)方法。尽管策略梯度目标函数是非凹的,近期研究利用该目标函数的平滑性与梯度支配特性实现了向最优策略的收敛。然而,这些理论结果要求根据未知的问题相关量(例如强盗问题中的最优动作或真实奖励向量)设置算法参数。为解决此问题,我们借鉴优化领域的思路,在精确与随机两种设置下设计了实用且规范化的策略梯度方法。在精确设置中,我们采用Armijo线搜索为Softmax策略梯度设定步长,并证明了线性收敛速率。在随机设置中,我们采用指数递减步长,并刻画了所得算法的收敛速率。研究表明,所提算法在提供与最先进成果相似理论保证的同时,无需依赖类先知量的先验知识。针对多臂强盗问题,我们的技术得到了一种理论规范化的策略梯度算法,该算法无需显式探索、奖励间隙知识、奖励分布信息或噪声参数。最后,我们通过实验将所提方法与需要先知知识的策略梯度方法进行比较,验证了其具有竞争力的性能表现。