We study robust reinforcement learning (RL) with the goal of determining a well-performing policy that is robust against model mismatch between the training simulator and the testing environment. Previous policy-based robust RL algorithms mainly focus on the tabular setting under uncertainty sets that facilitate robust policy evaluation, but are no longer tractable when the number of states scales up. To this end, we propose two novel uncertainty set formulations, one based on double sampling and the other on an integral probability metric. Both make large-scale robust RL tractable even when one only has access to a simulator. We propose a robust natural actor-critic (RNAC) approach that incorporates the new uncertainty sets and employs function approximation. We provide finite-time convergence guarantees for the proposed RNAC algorithm to the optimal robust policy within the function approximation error. Finally, we demonstrate the robust performance of the policy learned by our proposed RNAC approach in multiple MuJoCo environments and a real-world TurtleBot navigation task.
翻译:我们研究鲁棒强化学习(RL),旨在确定一种能够在训练模拟器与测试环境之间存在模型失配时仍表现良好的策略。以往基于策略的鲁棒强化学习算法主要研究表格化设定下的不确定性集合,这类方法便于进行鲁棒策略评估,但当状态数量增加时便不再可行。为此,我们提出两种新颖的不确定性集合构建方法:一种基于双重采样,另一种基于积分概率度量。这两种方法使得即便仅能访问模拟器,大规模鲁棒强化学习也能变得可行。我们提出了一种鲁棒自然演员-评论家(RNAC)方法,该方法整合了新的不确定性集合并采用函数逼近。我们为所提RNAC算法提供了有限时间收敛性保证,证明其可在函数逼近误差范围内收敛至最优鲁棒策略。最后,我们在多个MuJoCo环境和一个真实世界的TurtleBot导航任务中,验证了所提RNAC方法学习策略的鲁棒性能。