Towards Principled, Practical Policy Gradient for Bandits and Tabular MDPs

We consider (stochastic) softmax policy gradient (PG) methods for bandits and tabular Markov decision processes (MDPs). While the PG objective is non-concave, recent research has used the objective's smoothness and gradient domination properties to achieve convergence to an optimal policy. However, these theoretical results require setting the algorithm parameters according to unknown problem-dependent quantities (e.g. the optimal action or the true reward vector in a bandit problem). To address this issue, we borrow ideas from the optimization literature to design practical, principled PG methods in both the exact and stochastic settings. In the exact setting, we employ an Armijo line-search to set the step-size for softmax PG and demonstrate a linear convergence rate. In the stochastic setting, we utilize exponentially decreasing step-sizes, and characterize the convergence rate of the resulting algorithm. We show that the proposed algorithm offers similar theoretical guarantees as the state-of-the art results, but does not require the knowledge of oracle-like quantities. For the multi-armed bandit setting, our techniques result in a theoretically-principled PG algorithm that does not require explicit exploration, the knowledge of the reward gap, the reward distributions, or the noise. Finally, we empirically compare the proposed methods to PG approaches that require oracle knowledge, and demonstrate competitive performance.

翻译：本文研究用于强盗问题与表格化马尔可夫决策过程（MDP）的（随机）Softmax策略梯度（PG）方法。尽管策略梯度目标函数是非凹的，近期研究利用该目标函数的平滑性与梯度支配特性实现了向最优策略的收敛。然而，这些理论结果要求根据未知的问题相关量（例如强盗问题中的最优动作或真实奖励向量）设置算法参数。为解决此问题，我们借鉴优化领域的思路，在精确与随机两种设置下设计了实用且规范化的策略梯度方法。在精确设置中，我们采用Armijo线搜索为Softmax策略梯度设定步长，并证明了线性收敛速率。在随机设置中，我们采用指数递减步长，并刻画了所得算法的收敛速率。研究表明，所提算法在提供与最先进成果相似理论保证的同时，无需依赖类先知量的先验知识。针对多臂强盗问题，我们的技术得到了一种理论规范化的策略梯度算法，该算法无需显式探索、奖励间隙知识、奖励分布信息或噪声参数。最后，我们通过实验将所提方法与需要先知知识的策略梯度方法进行比较，验证了其具有竞争力的性能表现。

相关内容

关注 0

Pacific Graphics是亚洲图形协会的旗舰会议。作为一个非常成功的会议系列，太平洋图形公司为太平洋沿岸以及世界各地的研究人员，开发人员，从业人员提供了一个高级论坛，以介绍和讨论计算机图形学及相关领域的新问题，解决方案和技术。太平洋图形会议的目的是召集来自各个领域的研究人员，以展示他们的最新成果，开展合作并为研究领域的发展做出贡献。会议将包括定期的论文讨论会，进行中的讨论会，教程以及由与计算机图形学和交互系统相关的所有领域的国际知名演讲者的演讲。官网地址：http://dblp.uni-trier.de/db/conf/pg/index.html

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日