Truncated Variance Reduced Value Iteration

We provide faster randomized algorithms for computing an $\epsilon$-optimal policy in a discounted Markov decision process with $A_{\text{tot}}$-state-action pairs, bounded rewards, and discount factor $\gamma$. We provide an $\tilde{O}(A_{\text{tot}}[(1 - \gamma)^{-3}\epsilon^{-2} + (1 - \gamma)^{-2}])$-time algorithm in the sampling setting, where the probability transition matrix is unknown but accessible through a generative model which can be queried in $\tilde{O}(1)$-time, and an $\tilde{O}(s + (1-\gamma)^{-2})$-time algorithm in the offline setting where the probability transition matrix is known and $s$-sparse. These results improve upon the prior state-of-the-art which either ran in $\tilde{O}(A_{\text{tot}}[(1 - \gamma)^{-3}\epsilon^{-2} + (1 - \gamma)^{-3}])$ time [Sidford, Wang, Wu, Ye 2018] in the sampling setting, $\tilde{O}(s + A_{\text{tot}} (1-\gamma)^{-3})$ time [Sidford, Wang, Wu, Yang, Ye 2018] in the offline setting, or time at least quadratic in the number of states using interior point methods for linear programming. We achieve our results by building upon prior stochastic variance-reduced value iteration methods [Sidford, Wang, Wu, Yang, Ye 2018]. We provide a variant that carefully truncates the progress of its iterates to improve the variance of new variance-reduced sampling procedures that we introduce to implement the steps. Our method is essentially model-free and can be implemented in $\tilde{O}(A_{\text{tot}})$-space when given generative model access. Consequently, our results take a step in closing the sample-complexity gap between model-free and model-based methods.

翻译：我们提出了更快的随机算法，用于在折扣马尔可夫决策过程中计算 $\epsilon$-最优策略，其中包含 $A_{\text{tot}}$ 个状态-动作对、有界奖励和折扣因子 $\gamma$。在采样设置中（转移概率矩阵未知，但可通过生成模型在 $\tilde{O}(1)$ 时间内查询），我们提供了时间复杂度为 $\tilde{O}(A_{\text{tot}}[(1 - \gamma)^{-3}\epsilon^{-2} + (1 - \gamma)^{-2}])$ 的算法；在离线设置中（转移概率矩阵已知且为 $s$-稀疏），我们提供了时间复杂度为 $\tilde{O}(s + (1-\gamma)^{-2})$ 的算法。这些结果改进了先前的最优方法：采样设置中 [Sidford, Wang, Wu, Ye 2018] 的时间复杂度为 $\tilde{O}(A_{\text{tot}}[(1 - \gamma)^{-3}\epsilon^{-2} + (1 - \gamma)^{-3}])$，离线设置中 [Sidford, Wang, Wu, Yang, Ye 2018] 的时间复杂度为 $\tilde{O}(s + A_{\text{tot}} (1-\gamma)^{-3})$，或使用线性规划内点法所需的时间至少与状态数呈二次关系。我们的成果建立在先前的随机方差缩减值迭代方法 [Sidford, Wang, Wu, Yang, Ye 2018] 之上。我们提供了一种变体，通过精心截断迭代过程的进展，改进了为实现步骤而引入的新型方差缩减采样过程的方差。我们的方法本质上无需模型，在给定生成模型访问权限时，仅需 $\tilde{O}(A_{\text{tot}})$ 空间即可实现。因此，我们的结果在缩小无模型方法与基于模型方法之间的样本复杂度差距上迈出了一步。