Reward-free reinforcement learning (RF-RL), a recently introduced RL paradigm, relies on random action-taking to explore the unknown environment without any reward feedback information. While the primary goal of the exploration phase in RF-RL is to reduce the uncertainty in the estimated model with minimum number of trajectories, in practice, the agent often needs to abide by certain safety constraint at the same time. It remains unclear how such safe exploration requirement would affect the corresponding sample complexity in order to achieve the desired optimality of the obtained policy in planning. In this work, we make a first attempt to answer this question. In particular, we consider the scenario where a safe baseline policy is known beforehand, and propose a unified Safe reWard-frEe ExploraTion (SWEET) framework. We then particularize the SWEET framework to the tabular and the low-rank MDP settings, and develop algorithms coined Tabular-SWEET and Low-rank-SWEET, respectively. Both algorithms leverage the concavity and continuity of the newly introduced truncated value functions, and are guaranteed to achieve zero constraint violation during exploration with high probability. Furthermore, both algorithms can provably find a near-optimal policy subject to any constraint in the planning phase. Remarkably, the sample complexities under both algorithms match or even outperform the state of the art in their constraint-free counterparts up to some constant factors, proving that safety constraint hardly increases the sample complexity for RF-RL.
翻译:无奖励强化学习(RF-RL)是一种近期提出的强化学习范式,它依赖随机动作选择来探索未知环境,且无需任何奖励反馈信息。尽管RF-RL探索阶段的主要目标是以最少的轨迹数量降低估计模型中的不确定性,但在实际应用中,智能体通常需要同时遵守某些安全约束。目前尚不清楚这种安全探索要求将如何影响实现规划阶段策略最优性所需的样本复杂度。本文首次尝试回答这一问题。具体而言,我们考虑已知安全基策略的场景,并提出统一的安全无奖励探索(SWEET)框架。随后,我们将SWEET框架具体应用于表格型与低秩马尔可夫决策过程(MDP)设置,并分别设计了称为表格-SWEET与低秩-SWEET的算法。两种算法均利用了新引入的截断值函数的凹性与连续性,并能在高概率下保证探索过程中零约束违反。此外,两种算法均能在规划阶段为任意约束提供近最优策略。值得注意的是,两种算法的样本复杂度在常数因子范围内与无约束状态下的最优方法相当甚至更优,证明安全约束几乎不会增加RF-RL的样本复杂度。