This paper delves into the problem of safe reinforcement learning (RL) in a partially observable environment with the aim of achieving safe-reachability objectives. In traditional partially observable Markov decision processes (POMDP), ensuring safety typically involves estimating the belief in latent states. However, accurately estimating an optimal Bayesian filter in POMDP to infer latent states from observations in a continuous state space poses a significant challenge, largely due to the intractable likelihood. To tackle this issue, we propose a stochastic model-based approach that guarantees RL safety almost surely in the face of unknown system dynamics and partial observation environments. We leveraged the Predictive State Representation (PSR) and Reproducing Kernel Hilbert Space (RKHS) to represent future multi-step observations analytically, and the results in this context are provable. Furthermore, we derived essential operators from the kernel Bayes' rule, enabling the recursive estimation of future observations using various operators. Under the assumption of \textit{undercompleness}, a polynomial sample complexity is established for the RL algorithm for the infinite size of observation and action spaces, ensuring an $\epsilon-$suboptimal safe policy guarantee.
翻译:本文深入研究了部分可观测环境下的安全强化学习问题,旨在实现安全可达性目标。在传统的部分可观测马尔可夫决策过程(POMDP)中,确保安全性通常需要估计潜在状态的置信度。然而,在连续状态空间中,通过精确估计最优贝叶斯滤波器以从观测中推断潜在状态是一项重大挑战,其主要原因在于似然函数难以处理。为解决这一问题,我们提出了一种基于随机模型的方法,该方法在面对未知系统动态和部分观测环境时,能以几乎必然的方式保证强化学习的安全性。我们利用预测状态表示(PSR)和再生核希尔伯特空间(RKHS)对未来的多步观测进行解析表示,并在此框架下证明了相关结果。此外,我们从核贝叶斯规则中推导出关键算子,使得能够通过多种算子递推地估计未来观测。在“欠完备性”假设下,针对具有无限规模观测空间和动作空间的强化学习算法,建立了多项式样本复杂度,从而确保了$\epsilon$-次优安全策略的保证。