To apply reinforcement learning (RL) to real-world applications, agents are required to adhere to the safety guidelines of their respective domains. Safe RL can effectively handle the guidelines by converting them into constraints of the RL problem. In this paper, we develop a safe distributional RL method based on the trust region method, which can satisfy constraints consistently. However, policies may not meet the safety guidelines due to the estimation bias of distributional critics, and importance sampling required for the trust region method can hinder performance due to its significant variance. Hence, we enhance safety performance through the following approaches. First, we train distributional critics to have low estimation biases using proposed target distributions where bias-variance can be traded off. Second, we propose novel surrogates for the trust region method expressed with Q-functions using the reparameterization trick. Additionally, depending on initial policy settings, there can be no policy satisfying constraints within a trust region. To handle this infeasible issue, we propose a gradient integration method which guarantees to find a policy satisfying all constraints from an unsafe initial policy. From extensive experiments, the proposed method with risk-averse constraints shows minimal constraint violations while achieving high returns compared to existing safe RL methods.
翻译:为将强化学习应用于实际场景,智能体需遵守其所在领域的安全准则。安全强化学习可通过将这些准则转化为强化学习问题的约束来有效处理。本文基于信赖域方法提出了一种安全分布式强化学习算法,该算法能够持续满足约束。然而,由于分布式评论家的估计偏差,策略可能无法满足安全准则,且信赖域方法所需的重要性采样因方差较大可能阻碍性能表现。为此,我们通过以下途径提升安全性能:首先,利用可权衡偏差-方差的建议目标分布训练低偏置分布式评论家;其次,基于重参数化技巧提出以Q函数表达的信赖域方法新型替代项。此外,根据初始策略设置,信赖域内可能不存在满足约束的策略。针对这一不可行问题,我们提出梯度积分方法,保证从非安全初始策略出发也能找到满足所有约束的策略。大量实验表明,与现有安全强化学习方法相比,采用风险规避约束的所提方法在实现高回报的同时表现出最小的约束违反。