This paper investigates the potential of quantum acceleration in addressing infinite horizon Markov Decision Processes (MDPs) to enhance average reward outcomes. We introduce an innovative quantum framework for the agent's engagement with an unknown MDP, extending the conventional interaction paradigm. Our approach involves the design of an optimism-driven tabular Reinforcement Learning algorithm that harnesses quantum signals acquired by the agent through efficient quantum mean estimation techniques. Through thorough theoretical analysis, we demonstrate that the quantum advantage in mean estimation leads to exponential advancements in regret guarantees for infinite horizon Reinforcement Learning. Specifically, the proposed Quantum algorithm achieves a regret bound of $\tilde{\mathcal{O}}(1)$, a significant improvement over the $\tilde{\mathcal{O}}(\sqrt{T})$ bound exhibited by classical counterparts.
翻译:本文研究了量子加速在解决无限时域马尔可夫决策过程(MDPs)中的潜力,以优化平均奖励结果。我们提出了一种创新的量子框架,用于智能体与未知MDP的交互,扩展了传统交互范式。该方法设计了一种基于乐观主义的表格强化学习算法,该算法利用智能体通过高效量子均值估计技术获得的量子信号。通过深入的理论分析,我们证明了均值估计中的量子优势能够为无限时域强化学习带来指数级的遗憾保证改进。具体而言,所提出的量子算法实现了$\tilde{\mathcal{O}}(1)$的遗憾界,相较于经典方法中$\tilde{\mathcal{O}}(\sqrt{T})$的界,取得了显著提升。