Robust Markov Decision Processes (MDPs) address environmental shift through distributionally robust optimization (DRO) by finding an optimal worst-case policy within an uncertainty set of transition kernels. However, standard DRO approaches require enlarging the uncertainty set under large shifts, which leads to overly conservative and pessimistic policies. In this paper, we propose a framework for transfer under environment shift that derives a robust target-domain policy via estimate-centered uncertainty sets, constructed through constrained estimation that integrates limited target samples with side information about the source-target dynamics. The side information includes bounds on feature moments, distributional distances, and density ratios, yielding improved kernel estimates and tighter uncertainty sets. The side information includes bounds on feature moments, distributional distances, and density ratios, yielding improved kernel estimates and tighter uncertainty sets. Error bounds and convergence results are established for both robust and non-robust value functions. Moreover, we provide a finite-sample guarantee on the learned robust policy and analyze the robust sub-optimality gap. Under mild low-dimensional structure on the transition model, the side information reduces this gap and improves sample efficiency. We assess the performance of our approach across OpenAI Gym environments and classic control problems, consistently demonstrating superior target-domain performance over state-of-the-art robust and non-robust baselines.
翻译:鲁棒马尔可夫决策过程通过分布鲁棒优化,在转移核的不确定性集合中寻找最优最坏情况策略,以应对环境偏移。然而,标准分布鲁棒优化方法在面临较大偏移时需要扩大不确定性集合,这会导致策略过于保守和悲观。本文提出一个环境偏移下的迁移学习框架,该框架通过构建以估计为中心的不确定性集合来推导鲁棒的目标域策略。这些集合通过约束估计构建,该估计将有限的目标域样本与关于源-目标动态的辅助信息相结合。辅助信息包括特征矩的界、分布距离以及密度比,从而得到改进的核估计和更紧的不确定性集合。我们为鲁棒和非鲁棒值函数建立了误差界和收敛性结果。此外,我们为学习到的鲁棒策略提供了有限样本保证,并分析了鲁棒次优性间隙。在转移模型具有温和的低维结构假设下,辅助信息能够减小该间隙并提升样本效率。我们在OpenAI Gym环境和经典控制问题中评估了所提方法的性能,其目标域表现始终优于当前最先进的鲁棒与非鲁棒基线方法。