In this work, we propose an information-directed objective for infinite-horizon reinforcement learning (RL), called the occupancy information ratio (OIR), inspired by the information ratio objectives used in previous information-directed sampling schemes for multi-armed bandits and Markov decision processes as well as recent advances in general utility RL. The OIR, comprised of a ratio between the average cost of a policy and the entropy of its induced state occupancy measure, enjoys rich underlying structure and presents an objective to which scalable, model-free policy search methods naturally apply. Specifically, we show by leveraging connections between quasiconcave optimization and the linear programming theory for Markov decision processes that the OIR problem can be transformed and solved via concave programming methods when the underlying model is known. Since model knowledge is typically lacking in practice, we lay the foundations for model-free OIR policy search methods by establishing a corresponding policy gradient theorem. Building on this result, we subsequently derive REINFORCE- and actor-critic-style algorithms for solving the OIR problem in policy parameter space. Crucially, exploiting the powerful hidden quasiconcavity property implied by the concave programming transformation of the OIR problem, we establish finite-time convergence of the REINFORCE-style scheme to global optimality and asymptotic convergence of the actor-critic-style scheme to (near) global optimality under suitable conditions. Finally, we experimentally illustrate the utility of OIR-based methods over vanilla methods in sparse-reward settings, supporting the OIR as an alternative to existing RL objectives.
翻译:本文提出一种适用于无限时域强化学习(RL)的信息导向目标函数——占用信息比(OIR),其灵感来源于先前多臂赌博机与马尔可夫决策过程中信息导向采样方案使用的信息比目标,以及通用效用RL领域的最新进展。OIR由策略的平均成本与其诱导的状态占用测度熵值之比构成,不仅具备丰富的底层结构,还天然适用于可扩展的无模型策略搜索方法。具体而言,我们通过利用拟凹优化与马尔可夫决策过程线性规划理论之间的联系证明:当底层模型已知时,OIR问题可通过凹规划方法进行转化与求解。鉴于实际场景中通常缺乏模型知识,我们通过建立相应的策略梯度定理,为无模型OIR策略搜索方法奠定理论基础。基于此结果,我们进一步推导出在策略参数空间中求解OIR问题的REINFORCE风格算法与演员-评论家风格算法。关键之处在于,通过利用OIR问题凹规划变换隐含的强大隐藏拟凹性,我们证实在合适条件下,REINFORCE风格方案可有限时间收敛至全局最优,演员-评论家风格方案可渐近收敛至(近似)全局最优。最后,我们通过实验证明,在稀疏奖励场景中,基于OIR的方法相比传统方法更具实用性,支持OIR作为现有RL目标的替代方案。