We study the $(\varepsilon, δ)$-PAC policy identification problem in finite-horizon episodic Markov Decision Processes. Existing approaches provide finite-time guarantees for approximate settings ($\varepsilon>0$) but suffer from high computational cost, rendering them hard to implement, and also suffer from suboptimal dependence on $\log(1/δ)$. We propose a randomized and computationally efficient algorithm for best policy identification that combines posterior sampling with an online learning algorithm to guide exploration in the MDP. Our method achieves asymptotic optimality in sample complexity, also in terms of posterior contraction rate, and runs in $O(S^2AH)$ per episode, matching standard model-based approaches. Unlike prior algorithms such as MOCA and PEDEL, our guarantees remain meaningful in the asymptotic regime and avoid sub-optimal polynomial dependence on $\log(1/δ)$. Our results provide both theoretical insights and practical tools for efficient policy identification in tabular MDPs.
翻译:我们研究了有限时域情节马尔可夫决策过程中的$(\varepsilon, δ)$-PAC策略识别问题。现有方法虽能确保近似设定($\varepsilon>0$)下的有限时间保证,但存在计算成本高昂导致难以实现的问题,且对$\log(1/δ)$的依赖存在次优性。我们提出了一种随机化且计算高效的算法用于最优策略识别,该算法将后验采样与在线学习算法相结合来引导MDP中的探索。我们的方法在样本复杂度(包括后验收缩速度)方面达到渐近最优性,且每次情节的计算复杂度为$O(S^2AH)$,与标准基于模型的方法相当。与MOCA和PEDEL等现有算法不同,我们的保证在渐近区域中仍然有效,且避免了在$\log(1/δ)$上出现次优多项式依赖。研究结果为表格MDP中的高效策略识别提供了理论洞见与实践工具。