We study robust reinforcement learning (RL) with the goal of determining a well-performing policy that is robust against model mismatch between the training simulator and the testing environment. Previous policy-based robust RL algorithms mainly focus on the tabular setting under uncertainty sets that facilitate robust policy evaluation, but are no longer tractable when the number of states scales up. To this end, we propose two novel uncertainty set formulations, one based on double sampling and the other on an integral probability metric. Both make large-scale robust RL tractable even when one only has access to a simulator. We propose a robust natural actor-critic (RNAC) approach that incorporates the new uncertainty sets and employs function approximation. We provide finite-time convergence guarantees for the proposed RNAC algorithm to the optimal robust policy within the function approximation error. Finally, we demonstrate the robust performance of the policy learned by our proposed RNAC approach in multiple MuJoCo environments and a real-world TurtleBot navigation task.
翻译:我们研究鲁棒强化学习,旨在确定一种在训练模拟器与测试环境之间存在模型失配时仍能保持良好性能的策略。以往基于策略的鲁棒强化学习算法主要关注表格化场景,借助便于鲁棒策略评估的不确定性集合,但当状态数量扩大时,这些算法变得难以处理。为此,我们提出了两种新颖的不确定性集合形式:一种基于双重采样,另一种基于积分概率度量。即使仅能访问模拟器,这两种方法也能使大规模鲁棒强化学习变得可行。我们提出了一种鲁棒自然演员-评论家方法,该方法融合了新的不确定性集合并采用函数近似。我们为所提出的鲁棒自然演员-评论家算法提供了有限时间收敛性保证,确保其在函数近似误差范围内收敛至最优鲁棒策略。最后,我们在多个MuJoCo环境以及真实世界的TurtleBot导航任务中,验证了所提出算法所学策略的鲁棒性能。