This paper studies safe Reinforcement Learning (safe RL) with linear function approximation and under hard instantaneous constraints where unsafe actions must be avoided at each step. Existing studies have considered safe RL with hard instantaneous constraints, but their approaches rely on several key assumptions: $(i)$ the RL agent knows a safe action set for {\it every} state or knows a {\it safe graph} in which all the state-action-state triples are safe, and $(ii)$ the constraint/cost functions are {\it linear}. In this paper, we consider safe RL with instantaneous hard constraints without assumption $(i)$ and generalize $(ii)$ to Reproducing Kernel Hilbert Space (RKHS). Our proposed algorithm, LSVI-AE, achieves $\tilde{\cO}(\sqrt{d^3H^4K})$ regret and $\tilde{\cO}(H \sqrt{dK})$ hard constraint violation when the cost function is linear and $\cO(H\gamma_K \sqrt{K})$ hard constraint violation when the cost function belongs to RKHS. Here $K$ is the learning horizon, $H$ is the length of each episode, and $\gamma_K$ is the information gain w.r.t the kernel used to approximate cost functions. Our results achieve the optimal dependency on the learning horizon $K$, matching the lower bound we provide in this paper and demonstrating the efficiency of LSVI-AE. Notably, the design of our approach encourages aggressive policy exploration, providing a unique perspective on safe RL with general cost functions and no prior knowledge of safe actions, which may be of independent interest.
翻译:本文研究在硬瞬时约束下(即每一步必须避免不安全动作),结合线性函数逼近的安全强化学习(safe RL)。现有研究已探讨了硬瞬时约束下的安全强化学习,但其方法依赖于若干关键假设:(1)强化学习智能体知道每个状态的安全动作集合,或知道一个安全图(其中所有状态-动作-状态三元组都是安全的);(2)约束/代价函数是线性的。本文在无假设(1)的情况下考虑了具有即时硬约束的安全强化学习,并将假设(2)推广至再生核希尔伯特空间(RKHS)。我们提出的算法LSVI-AE在代价函数为线性时实现$\tilde{\cO}(\sqrt{d^3H^4K})$的遗憾和$\tilde{\cO}(H \sqrt{dK})$的硬约束违反,在代价函数属于RKHS时实现$\cO(H\gamma_K \sqrt{K})$的硬约束违反。其中$K$为学习周期,$H$为每回合长度,$\gamma_K$为关于用于近似代价函数的核的信息增益。我们的结果在学习周期$K$上达到最优依赖,匹配本文提供的下界,并展示了LSVI-AE的高效性。值得注意的是,该方法的设计鼓励激进策略探索,为在无安全动作先验知识且具有一般代价函数的安全强化学习中提供了独特视角,这可能具有独立的研究价值。