Reinforcement Learning (RL) has been widely applied to many control tasks and substantially improved the performances compared to conventional control methods in many domains where the reward function is well defined. However, for many real-world problems, it is often more convenient to formulate optimization problems in terms of rewards and constraints simultaneously. Optimizing such constrained problems via reward shaping can be difficult as it requires tedious manual tuning of reward functions with several interacting terms. Recent formulations which include constraints mostly require a pre-training phase, which often needs human expertise to collect data or assumes having a sub-optimal policy readily available. We propose a new constrained RL method called CSAC-LB (Constrained Soft Actor-Critic with Log Barrier Function), which achieves competitive performance without any pre-training by applying a linear smoothed log barrier function to an additional safety critic. It implements an adaptive penalty for policy learning and alleviates the numerical issues that are known to complicate the application of the log barrier function method. As a result, we show that with CSAC-LB, we achieve state-of-the-art performance on several constrained control tasks with different levels of difficulty and evaluate our methods in a locomotion task on a real quadruped robot platform.
翻译:强化学习已被广泛应用于许多控制任务,并在奖励函数定义明确的领域中相较于传统控制方法显著提升了性能。然而,对于许多实际问题,通常更便于在奖励和约束同时存在的情况下制定优化问题。通过奖励塑造来优化这类约束问题可能十分困难,因为这需要手动调整包含多个交互项的奖励函数,过程繁琐且耗时。近期包含约束的公式化方法大多需要预训练阶段,这往往需要人类专业知识来收集数据,或假设已具备次优策略。我们提出了一种名为CSAC-LB(基于对数障碍函数的约束软演员-评论家算法)的新型约束强化学习方法,该方法通过将线性平滑对数障碍函数应用于附加的安全评论家网络,无需任何预训练即可实现有竞争力的性能。该方法实现了策略学习的自适应惩罚,并缓解了已知会使对数障碍函数方法应用复杂化的数值问题。结果表明,采用CSAC-LB方法,我们在不同难度级别的多个约束控制任务上达到了最先进的性能,并在真实四足机器人平台上对运动任务进行了方法评估。