Robust Markov Decision Processes (RMDPs) have received significant research interest, offering an alternative to standard Markov Decision Processes (MDPs) that often assume fixed transition probabilities. RMDPs address this by optimizing for the worst-case scenarios within ambiguity sets. While earlier studies on RMDPs have largely centered on risk-neutral reinforcement learning (RL), with the goal of minimizing expected total discounted costs, in this paper, we analyze the robustness of CVaR-based risk-sensitive RL under RMDP. Firstly, we consider predetermined ambiguity sets. Based on the coherency of CVaR, we establish a connection between robustness and risk sensitivity, thus, techniques in risk-sensitive RL can be adopted to solve the proposed problem. Furthermore, motivated by the existence of decision-dependent uncertainty in real-world problems, we study problems with state-action-dependent ambiguity sets. To solve this, we define a new risk measure named NCVaR and build the equivalence of NCVaR optimization and robust CVaR optimization. We further propose value iteration algorithms and validate our approach in simulation experiments.
翻译:鲁棒马尔可夫决策过程(RMDP)作为标准马尔可夫决策过程(MDP)的替代方案,因解决其固定转移概率假设的局限性而备受关注,通过优化模糊集内的最坏场景来应对这一问题。早期RMDP研究主要聚焦于风险中性强化学习(RL),以最小化期望总折扣成本为目标,而本文系统分析了基于条件风险值(CVaR)的风险敏感强化学习在RMDP框架下的鲁棒性。首先,我们考虑预定义模糊集,基于CVaR的一致性特性建立鲁棒性与风险敏感性之间的关联,从而可借鉴风险敏感强化学习技术求解该问题。进而,针对实际应用中存在的决策依赖不确定性,我们研究了状态-动作依赖的模糊集问题。为解决该问题,我们定义了一种名为NCVaR的新型风险度量,并证明NCVaR优化与鲁棒CVaR优化的等价性。最后,我们提出值迭代算法,并通过仿真实验验证了方法的有效性。