Compositionality is a critical aspect of scalable system design. Reinforcement learning (RL) has recently shown substantial success in task learning, but has only recently begun to truly leverage composition. In this paper, we focus on Boolean composition of learned tasks as opposed to functional or sequential composition. Existing Boolean composition for RL focuses on reaching a satisfying absorbing state in environments with discrete action spaces, but does not support composable safety (i.e., avoidance) constraints. We advance the state of the art in Boolean composition of learned tasks with three contributions: i) introduce two distinct notions of safety in this framework; ii) show how to enforce either safety semantics, prove correctness (under some assumptions), and analyze the trade-offs between the two safety notions; and iii) extend Boolean composition from discrete action spaces to continuous action spaces. We demonstrate these techniques using modified versions of value iteration in a grid world, Deep Q-Network (DQN) in a grid world with image observations, and Twin Delayed DDPG (TD3) in a continuous-observation and continuous-action Bullet physics environment. We believe that these contributions advance the theory of safe reinforcement learning by allowing zero-shot composition of policies satisfying safety properties.
翻译:组合性是可扩展系统设计的关键方面。强化学习近期在任务学习方面取得了显著成功,但直到最近才开始真正利用组合性。本文专注于学习任务的布尔组合,而非函数组合或顺序组合。现有的强化学习布尔组合方法主要针对离散动作空间环境中达到令人满意的吸收状态,但不支持可组合的安全约束(即规避约束)。我们通过三项贡献推进了布尔组合学习任务的前沿:i) 在该框架中引入两种不同的安全性概念;ii) 展示如何强制执行任一种安全性语义、证明正确性(在特定假设下)并分析两种安全性概念之间的权衡;iii) 将布尔组合从离散动作空间扩展到连续动作空间。我们通过网格世界中的改进版值迭代、基于图像观测的网格世界中的深度Q网络以及连续观测与连续动作的子弹物理环境中的双延迟深度确定性策略梯度来演示这些技术。我们相信这些贡献通过允许满足安全性属性的零样本策略组合,推进了安全强化学习的理论发展。