This paper aims to solve a safe reinforcement learning (RL) problem with risk measure-based constraints. As risk measures, such as conditional value at risk (CVaR), focus on the tail distribution of cost signals, constraining risk measures can effectively prevent a failure in the worst case. An on-policy safe RL method, called TRC, deals with a CVaR-constrained RL problem using a trust region method and can generate policies with almost zero constraint violations with high returns. However, to achieve outstanding performance in complex environments and satisfy safety constraints quickly, RL methods are required to be sample efficient. To this end, we propose an off-policy safe RL method with CVaR constraints, called off-policy TRC. If off-policy data from replay buffers is directly used to train TRC, the estimation error caused by the distributional shift results in performance degradation. To resolve this issue, we propose novel surrogate functions, in which the effect of the distributional shift can be reduced, and introduce an adaptive trust-region constraint to ensure a policy not to deviate far from replay buffers. The proposed method has been evaluated in simulation and real-world environments and satisfied safety constraints within a few steps while achieving high returns even in complex robotic tasks.
翻译:本文旨在解决带有风险度量约束的安全强化学习问题。由于条件风险价值(CVaR)等风险度量关注成本信号的尾部分布,约束风险度量可有效防止最坏情况下的失败。一种称为TRC的在策略安全强化学习方法采用置信域方法处理CVaR约束强化学习问题,能够在获得高回报的同时生成几乎无约束违规的策略。然而,为在复杂环境中实现卓越性能并快速满足安全约束,强化学习方法需要具备样本效率。为此,我们提出一种带有CVaR约束的离策略安全强化学习方法(称为离策略TRC)。若直接使用来自经验回放缓冲区的离策略数据训练TRC,分布偏移导致的估计误差会造成性能下降。为解决该问题,我们提出新型替代函数以减少分布偏移的影响,并引入自适应置信域约束以确保策略不偏离经验回放缓冲区过远。该方法已在仿真和实际环境中得到验证,能够在数步内满足安全约束,同时在复杂机器人任务中实现高回报。