Distributionally robust offline reinforcement learning (RL) aims to find a policy that performs the best under the worst environment within an uncertainty set using an offline dataset collected from a nominal model. While recent advances in robust RL focus on Markov decision processes (MDPs), robust non-Markovian RL is limited to planning problem where the transitions in the uncertainty set are known. In this paper, we study the learning problem of robust offline non-Markovian RL. Specifically, when the nominal model admits a low-rank structure, we propose a new algorithm, featuring a novel dataset distillation and a lower confidence bound (LCB) design for robust values under different types of the uncertainty set. We also derive new dual forms for these robust values in non-Markovian RL, making our algorithm more amenable to practical implementation. By further introducing a novel type-I concentrability coefficient tailored for offline low-rank non-Markovian decision processes, we prove that our algorithm can find an $\epsilon$-optimal robust policy using $O(1/\epsilon^2)$ offline samples. Moreover, we extend our algorithm to the case when the nominal model does not have specific structure. With a new type-II concentrability coefficient, the extended algorithm also enjoys polynomial sample efficiency under all different types of the uncertainty set.
翻译:分布鲁棒离线强化学习旨在利用从标称模型收集的离线数据集,在不确定性集合内寻找最坏环境下表现最优的策略。尽管当前鲁棒强化学习的进展主要集中于马尔可夫决策过程,鲁棒非马尔可夫强化学习的研究仍局限于不确定性集合中转移概率已知的规划问题。本文研究了鲁棒离线非马尔可夫强化学习的学习问题。具体而言,当标称模型具有低秩结构时,我们提出了一种新算法,其特点包括针对不同类型不确定性集合下鲁棒值函数的新型数据集蒸馏方法和置信下界设计。我们还推导了非马尔可夫强化学习中这些鲁棒值函数的对偶形式,使算法更易于实际实现。通过进一步引入专为离线低秩非马尔可夫决策过程设计的新型I类集中系数,我们证明该算法能够使用$O(1/\epsilon^2)$个离线样本找到$\epsilon$最优鲁棒策略。此外,我们将算法扩展至标称模型无特定结构的情形。借助新型II类集中系数,扩展算法在所有不同类型的不确定性集合下均具有多项式样本效率。