Offline reinforcement learning (RL) faces a significant challenge of distribution shift. Model-free offline RL penalizes the Q value for out-of-distribution (OOD) data or constrains the policy closed to the behavior policy to tackle this problem, but this inhibits the exploration of the OOD region. Model-based offline RL, which uses the trained environment model to generate more OOD data and performs conservative policy optimization within that model, has become an effective method for this problem. However, the current model-based algorithms rarely consider agent robustness when incorporating conservatism into policy. Therefore, the new model-based offline algorithm with a conservative Bellman operator (MICRO) is proposed. This method trades off performance and robustness via introducing the robust Bellman operator into the algorithm. Compared with previous model-based algorithms with robust adversarial models, MICRO can significantly reduce the computation cost by only choosing the minimal Q value in the state uncertainty set. Extensive experiments demonstrate that MICRO outperforms prior RL algorithms in offline RL benchmark and is considerably robust to adversarial perturbations.
翻译:离线强化学习面临分布偏移的重大挑战。无模型离线强化学习通过惩罚分布外数据的Q值或约束策略接近行为策略来解决该问题,但这抑制了对分布外区域的探索。基于模型的离线强化学习利用训练好的环境模型生成更多分布外数据,并在该模型内执行保守策略优化,已成为解决该问题的有效方法。然而,当前基于模型的算法在将保守性纳入策略时鲜少考虑智能体的鲁棒性。为此,本文提出采用保守贝尔曼算子的新型基于模型离线算法MICRO。该方法通过将鲁棒贝尔曼算子引入算法,在性能与鲁棒性之间取得权衡。与先前采用鲁棒对抗模型的基于模型算法相比,MICRO仅需选取状态不确定集中的最小Q值即可显著降低计算成本。大量实验表明,MICRO在离线强化学习基准测试中优于先前强化学习算法,并对对抗扰动具有显著鲁棒性。