Offline reinforcement learning aims to find the optimal policy from a pre-collected dataset without active exploration. This problem is faced with major challenges, such as a limited amount of data and distribution shift. Existing studies employ the principle of pessimism in face of uncertainty, and penalize rewards for less visited state-action pairs. In this paper, we directly model the uncertainty in the transition kernel using an uncertainty set, and then employ the approach of distributionally robust optimization that optimizes the worst-case performance over the uncertainty set. We first design a Hoeffding-style uncertainty set, which guarantees that the true transition kernel lies in the uncertainty set with high probability. We theoretically prove that it achieves an $\epsilon$-accuracy with a sample complexity of $\mathcal{O}\left((1-\gamma)^{-4}\epsilon^{-2}SC^{\pi^*} \right)$, where $\gamma$ is the discount factor, $C^{\pi^*}$ is the single-policy concentrability for any comparator policy $\pi^*$, and $S$ is the number of states. We further design a Bernstein-style uncertainty set, which does not necessarily guarantee the true transition kernel lies in the uncertainty set. We show an improved and near-optimal sample complexity of $\mathcal{O}\left((1-\gamma)^{-3}\epsilon^{-2}\left(SC^{\pi^*}+(\mu_{\min})^{-1}\right) \right)$, where $\mu_{\min}$ denotes the minimal non-zero entry of the behavior distribution. In addition, the computational complexity of our algorithms is the same as one of the LCB-based methods in the literature. Our results demonstrate that distributionally robust optimization method can also efficiently solve offline reinforcement learning.
翻译:离线强化学习旨在从预先收集的数据集中学习最优策略,而无需主动探索。该问题面临数据量有限和分布偏移等重大挑战。现有研究在不确定性面前采用悲观原则,对访问较少的“状态-动作”对进行奖励惩罚。本文直接利用不确定集对转移核中的不确定性进行建模,并采用分布鲁棒优化方法,在不确定集上优化最坏情况下的性能。我们首先设计了一个Hoeffding型不确定集,该集合能以高概率保证真实转移核包含在内。理论上证明,该方法达到$\epsilon$精度所需的样本复杂度为$\mathcal{O}\left((1-\gamma)^{-4}\epsilon^{-2}SC^{\pi^*} \right)$,其中$\gamma$为折扣因子,$C^{\pi^*}$为任意比较策略$\pi^*$的单策略可集中性,$S$为状态数。进一步地,我们设计了一个Bernstein型不确定集,该集合不一定能保证真实转移核包含在内。我们证明了改进且近乎最优的样本复杂度$\mathcal{O}\left((1-\gamma)^{-3}\epsilon^{-2}\left(SC^{\pi^*}+(\mu_{\min})^{-1}\right) \right)$,其中$\mu_{\min}$表示行为分布的最小非零值。此外,我们算法的计算复杂度与文献中基于LCB的方法相同。研究结果表明,分布鲁棒优化方法也能高效求解离线强化学习问题。