Classical reinforcement learning (RL) aims to optimize the expected cumulative reward. In this work, we consider the RL setting where the goal is to optimize the quantile of the cumulative reward. We parameterize the policy controlling actions by neural networks, and propose a novel policy gradient algorithm called Quantile-Based Policy Optimization (QPO) and its variant Quantile-Based Proximal Policy Optimization (QPPO) for solving deep RL problems with quantile objectives. QPO uses two coupled iterations running at different timescales for simultaneously updating quantiles and policy parameters, whereas QPPO is an off-policy version of QPO that allows multiple updates of parameters during one simulation episode, leading to improved algorithm efficiency. Our numerical results indicate that the proposed algorithms outperform the existing baseline algorithms under the quantile criterion.
翻译:经典强化学习旨在优化累积奖励的期望值。本文考虑以累积奖励的分位数为优化目标的强化学习设置。我们采用神经网络对控制动作的策略进行参数化,并提出一种名为基于分位数的策略优化(QPO)的新型策略梯度算法及其变体基于分位数的近端策略优化(QPPO),用于解决分位数目标下的深度强化学习问题。QPO通过在不同时间尺度下运行的双耦合迭代同时更新分位数和策略参数,而QPPO是QPO的离策略版本,允许在一个仿真回合中对参数进行多次更新,从而提升算法效率。数值结果表明,在分位数准则下,所提算法优于现有基准算法。