Motivated by the need for a robust policy in the face of environment shifts between training and the deployment, we contribute to the theoretical foundation of distributionally robust reinforcement learning (DRRL). This is accomplished through a comprehensive modeling framework centered around distributionally robust Markov decision processes (DRMDPs). This framework obliges the decision maker to choose an optimal policy under the worst-case distributional shift orchestrated by an adversary. By unifying and extending existing formulations, we rigorously construct DRMDPs that embraces various modeling attributes for both the decision maker and the adversary. These attributes include adaptability granularity, exploring history-dependent, Markov, and Markov time-homogeneous decision maker and adversary dynamics. Additionally, we delve into the flexibility of shifts induced by the adversary, examining SA and S-rectangularity. Within this DRMDP framework, we investigate conditions for the existence or absence of the dynamic programming principle (DPP). From an algorithmic standpoint, the existence of DPP holds significant implications, as the vast majority of existing data and computationally efficiency RL algorithms are reliant on the DPP. To study its existence, we comprehensively examine combinations of controller and adversary attributes, providing streamlined proofs grounded in a unified methodology. We also offer counterexamples for settings in which a DPP with full generality is absent.
翻译:受限于训练与部署环境变化场景下对鲁棒策略的需求,本研究致力于夯实分布鲁棒强化学习(DRRL)的理论基础。通过构建以分布鲁棒马尔可夫决策过程(DRMDPs)为核心的综合建模框架,我们要求决策者在对抗方编排的最坏情况分布偏移下选择最优策略。通过统一并拓展现有理论,我们严格构建了涵盖决策者与对抗方多种建模属性的DRMDPs框架,这些属性包括适应性粒度、历史依赖性、马尔可夫性与马尔可夫时间齐次性动态。同时,我们深入探讨了对抗方诱导的偏移灵活性,系统研究了SA-矩形性与S-矩形性约束。在该DRMDP框架下,我们考察了动态规划原理(DPP)存在或不存在的条件。从算法视角看,DPP的存在性具有重大意义——当前多数兼具数据效率与计算效率的强化学习算法均依赖DPP。为研究其存在性,我们全面探索了控制器与对抗方属性的组合情况,给出基于统一方法论的精简证明,并针对无法保证完全泛化性DPP的配置提供了反例。