This paper investigates the potential of quantum acceleration in addressing infinite horizon Markov Decision Processes (MDPs) to enhance average reward outcomes. We introduce an innovative quantum framework for the agent's engagement with an unknown MDP, extending the conventional interaction paradigm. Our approach involves the design of an optimism-driven tabular Reinforcement Learning algorithm that harnesses quantum signals acquired by the agent through efficient quantum mean estimation techniques. Through thorough theoretical analysis, we demonstrate that the quantum advantage in mean estimation leads to exponential advancements in regret guarantees for infinite horizon Reinforcement Learning. Specifically, the proposed Quantum algorithm achieves a regret bound of $\tilde{\mathcal{O}}(1)$, a significant improvement over the $\tilde{\mathcal{O}}(\sqrt{T})$ bound exhibited by classical counterparts.
翻译:本文研究了量子加速在解决无限视界马尔可夫决策过程(MDPs)中提升平均奖励结果的潜力。我们引入了一个创新的量子框架,用于智能体与未知MDP的交互,扩展了传统的交互范式。我们的方法涉及设计一个基于乐观驱动的表格型强化学习算法,该算法通过高效的量子均值估计技术利用智能体获取的量子信号。通过深入的理论分析,我们证明了量子优势在均值估计中导致无限视界强化学习遗憾保证的指数级进步。具体而言,所提出的量子算法实现了$\tilde{\mathcal{O}}(1)$的遗憾界,相较于经典对应算法展现的$\tilde{\mathcal{O}}(\sqrt{T})$界有显著提升。