Distributionally robust reinforcement learning (DR-RL) has recently gained significant attention as a principled approach that addresses discrepancies between training and testing environments. To balance robustness, conservatism, and computational traceability, the literature has introduced DR-RL models with SA-rectangular and S-rectangular adversaries. While most existing statistical analyses focus on SA-rectangular models, owing to their algorithmic simplicity and the optimality of deterministic policies, S-rectangular models more accurately capture distributional discrepancies in many real-world applications and often yield more effective robust randomized policies. In this paper, we study the empirical value iteration algorithm for divergence-based S-rectangular DR-RL and establish near-optimal sample complexity bounds of $\widetilde{O}(|\mathcal{S}||\mathcal{A}|(1-γ)^{-4}\varepsilon^{-2})$, where $\varepsilon$ is the target accuracy, $|\mathcal{S}|$ and $|\mathcal{A}|$ denote the cardinalities of the state and action spaces, and $γ$ is the discount factor. To the best of our knowledge, these are the first sample complexity results for divergence-based S-rectangular models that achieve optimal dependence on $|\mathcal{S}|$, $|\mathcal{A}|$, and $\varepsilon$ simultaneously. We further validate this theoretical dependence through numerical experiments on a robust inventory control problem and a theoretical worst-case example, demonstrating the fast learning performance of our proposed algorithm.
翻译:分布鲁棒强化学习(DR-RL)近期作为一种解决训练与测试环境差异的原则性方法获得了广泛关注。为平衡鲁棒性、保守性和计算可追溯性,文献中引入了具有SA矩形和S矩形对抗者的DR-RL模型。尽管大多数现有统计分析侧重于SA矩形模型(因其算法简单性和确定性策略的最优性),但S矩形模型更能准确捕捉许多实际应用中的分布差异,并通常产生更有效的鲁棒随机策略。本文研究了基于散度的S矩形DR-RL的经验值迭代算法,并建立了近最优的样本复杂度界 $\widetilde{O}(|\mathcal{S}||\mathcal{A}|(1-γ)^{-4}\varepsilon^{-2})$,其中 $\varepsilon$ 为目标精度,$|\mathcal{S}|$ 和 $|\mathcal{A}|$ 分别表示状态空间和动作空间的大小,$γ$ 为折扣因子。据我们所知,这是针对基于散度的S矩形模型首次同时实现对 $|\mathcal{S}|$、$|\mathcal{A}|$ 和 $\varepsilon$ 最优依赖性的样本复杂度结果。我们进一步通过鲁棒库存控制问题的数值实验和理论最坏情况示例验证了该理论依赖性,展示了所提算法的快速学习性能。