Offline reinforcement learning (RL) enables data-efficient and safe policy learning without online exploration, but its performance often degrades under distribution shift. The learned policy may visit out-of-distribution state-action pairs where value estimates and learned dynamics are unreliable. To address policy-induced extrapolation and transition uncertainty in a unified framework, we formulate offline RL as robust policy optimization, treating the transition kernel as a decision variable within an uncertainty set and optimizing the policy against the worst-case dynamics. We propose Robust Regularized Policy Iteration (RRPI), which replaces the intractable max-min bilevel objective with a tractable KL-regularized surrogate and derives an efficient policy iteration procedure based on a robust regularized Bellman operator. We provide theoretical guarantees by showing that the proposed operator is a $γ$-contraction and that iteratively updating the surrogate yields monotonic improvement of the original robust objective with convergence. Experiments on D4RL benchmarks demonstrate that RRPI achieves strong average performance, outperforming recent baselines including percentile-based methods on the majority of environments while remaining competitive on the rest. Moreover, RRPI exhibits robust performance by aligning lower $Q$-values with high epistemic uncertainty, which prevents the policy from executing unreliable out-of-distribution actions.
翻译:离线强化学习(RL)能够实现数据高效且安全的策略学习,而无需在线探索,但其性能在分布偏移下通常会下降。学习到的策略可能会访问分布外的状态-动作对,其中价值估计和学习到的动态模型是不可靠的。为了在一个统一的框架中解决策略引起的推断和转移不确定性问题,我们将离线RL表述为鲁棒策略优化,将转移核视为不确定性集合内的一个决策变量,并针对最坏情况的动态优化策略。我们提出了鲁棒正则化策略迭代(RRPI),该方法用一个可处理的KL正则化代理目标替代了难以求解的最大-最小双层目标,并基于一个鲁棒正则化的Bellman算子推导出一种高效的策略迭代过程。我们提供了理论保证,证明了所提出的算子是一个$γ$-压缩映射,并且迭代更新代理目标能带来原始鲁棒目标的单调改进与收敛性。在D4RL基准测试上的实验表明,RRPI实现了强大的平均性能,在大多数环境中优于包括基于百分位数方法在内的近期基线方法,同时在其余环境中保持竞争力。此外,RRPI通过将较低的$Q$值与较高的认知不确定性对齐来展现出鲁棒性能,这防止了策略执行不可靠的分布外动作。