We study non-rectangular robust Markov decision processes under the average-reward criterion, where the ambiguity set couples transition probabilities across states and the adversary commits to a stationary kernel for the entire horizon. We show that any history-dependent policy achieving sublinear expected regret uniformly over the ambiguity set is robust-optimal, and that the robust value admits a minimax representation as the infimum over the ambiguity set of the classical optimal gains, without requiring any form of rectangularity or robust dynamic programming principle. Under the weak communication assumption, we establish the existence of such policies by converting high-probability regret bounds from the average-reward reinforcement learning literature into the expected-regret criterion. We then introduce a transient-value framework to evaluate finite-time performance of robust optimal policies, proving that average-reward optimality alone can mask arbitrarily poor transients and deriving regret-based lower bounds on transient values. Finally, we construct an epoch-based policy that combines an optimal stationary policy for the worst-case model with an anytime-valid sequential test and an online learning fallback, achieving a constant-order transient value.
翻译:我们研究了平均奖励准则下的非矩形鲁棒马尔可夫决策过程,其中模糊集耦合了状态间的转移概率,且对手在整个时间范围内承诺使用一个平稳核。我们证明,任何在模糊集上一致实现次线性期望遗憾的历史依赖策略都是鲁棒最优的,并且鲁棒值允许一个极小极大表示,即经典最优增益在模糊集上的下确界,而无需任何形式的矩形性或鲁棒动态规划原理。在弱通信假设下,我们通过将平均奖励强化学习文献中的高概率遗憾界转化为期望遗憾准则,证明了此类策略的存在性。随后,我们引入了一个瞬态值框架来评估鲁棒最优策略的有限时间性能,证明了仅凭平均奖励最优性可能掩盖任意糟糕的瞬态表现,并推导了基于遗憾的瞬态值下界。最后,我们构建了一种基于周期的策略,该策略将最坏情况模型下的最优平稳策略、任意时间有效的序贯检验以及在线学习后备方案相结合,实现了常数阶的瞬态值。