In many applications of Reinforcement Learning (RL), it is critically important that the algorithm performs safely, such that instantaneous hard constraints are satisfied at each step, and unsafe states and actions are avoided. However, existing algorithms for ''safe'' RL are often designed under constraints that either require expected cumulative costs to be bounded or assume all states are safe. Thus, such algorithms could violate instantaneous hard constraints and traverse unsafe states (and actions) in practice. Therefore, in this paper, we develop the first near-optimal safe RL algorithm for episodic Markov Decision Processes with unsafe states and actions under instantaneous hard constraints and the linear mixture model. It not only achieves a regret $\tilde{O}(\frac{d H^3 \sqrt{dK}}{\Delta_c})$ that tightly matches the state-of-the-art regret in the setting with only unsafe actions and nearly matches that in the unconstrained setting, but is also safe at each step, where $d$ is the feature-mapping dimension, $K$ is the number of episodes, $H$ is the number of steps in each episode, and $\Delta_c$ is a safety-related parameter. We also provide a lower bound $\tilde{\Omega}(\max\{dH \sqrt{K}, \frac{H}{\Delta_c^2}\})$, which indicates that the dependency on $\Delta_c$ is necessary. Further, both our algorithm design and regret analysis involve several novel ideas, which may be of independent interest.
翻译:在强化学习的诸多应用中,算法安全运行至关重要,需满足每步瞬时硬约束,避免不安全状态与动作。然而,现有"安全"强化学习算法通常设计于期望累积代价有界或假设所有状态均安全的约束下,因此在实际中可能违反瞬时硬约束,导致遍历不安全状态(与动作)。本文针对包含不安全状态与动作的回合制马尔可夫决策过程,提出首个满足瞬时硬约束与线性混合模型的近最优安全强化学习算法。该算法不仅实现了与仅含不安全动作场景下最优遗憾值紧密匹配的遗憾界$\tilde{O}(\frac{d H^3 \sqrt{dK}}{\Delta_c})$(其中$d$为特征映射维度,$K$为回合数,$H$为每回合步数,$\Delta_c$为安全相关参数),且每个步骤均保持安全。此外,我们给出下界$\tilde{\Omega}(\max\{dH \sqrt{K}, \frac{H}{\Delta_c^2}\})$,表明对$\Delta_c$的依赖具有必要性。同时,算法设计与遗憾分析中包含多项具有独立价值的新颖思路。