Robust reinforcement learning (RL) aims to find a policy that optimizes the worst-case performance in the face of uncertainties. In this paper, we focus on action robust RL with the probabilistic policy execution uncertainty, in which, instead of always carrying out the action specified by the policy, the agent will take the action specified by the policy with probability $1-\rho$ and an alternative adversarial action with probability $\rho$. We establish the existence of an optimal policy on the action robust MDPs with probabilistic policy execution uncertainty and provide the action robust Bellman optimality equation for its solution. Furthermore, we develop Action Robust Reinforcement Learning with Certificates (ARRLC) algorithm that achieves minimax optimal regret and sample complexity. Furthermore, we conduct numerical experiments to validate our approach's robustness, demonstrating that ARRLC outperforms non-robust RL algorithms and converges faster than the robust TD algorithm in the presence of action perturbations.
翻译:鲁棒强化学习旨在寻找在不确定性面前能够优化最坏情况性能的策略。本文聚焦于具有概率性策略执行不确定性的动作鲁棒强化学习问题,其中智能体并非始终执行策略指定的动作,而是以概率$1-\rho$执行策略指定动作,以概率$\rho$执行替代性的对抗动作。我们证明了在具有概率性策略执行不确定性的动作鲁棒MDP上最优策略的存在性,并给出了其求解所需的动作鲁棒贝尔曼最优方程。进一步地,我们提出了带可证明保证的动作鲁棒强化学习(ARRLC)算法,该算法实现了极小化最优遗憾值与样本复杂度。最后,我们通过数值实验验证了所提方法的鲁棒性,结果表明在存在动作扰动时,ARRLC算法优于非鲁棒强化学习算法,且收敛速度比鲁棒TD算法更快。