While quantum reinforcement learning (RL) has attracted a surge of attention recently, its theoretical understanding is limited. In particular, it remains elusive how to design provably efficient quantum RL algorithms that can address the exploration-exploitation trade-off. To this end, we propose a novel UCRL-style algorithm that takes advantage of quantum computing for tabular Markov decision processes (MDPs) with $S$ states, $A$ actions, and horizon $H$, and establish an $\mathcal{O}(\mathrm{poly}(S, A, H, \log T))$ worst-case regret for it, where $T$ is the number of episodes. Furthermore, we extend our results to quantum RL with linear function approximation, which is capable of handling problems with large state spaces. Specifically, we develop a quantum algorithm based on value target regression (VTR) for linear mixture MDPs with $d$-dimensional linear representation and prove that it enjoys $\mathcal{O}(\mathrm{poly}(d, H, \log T))$ regret. Our algorithms are variants of UCRL/UCRL-VTR algorithms in classical RL, which also leverage a novel combination of lazy updating mechanisms and quantum estimation subroutines. This is the key to breaking the $\Omega(\sqrt{T})$-regret barrier in classical RL. To the best of our knowledge, this is the first work studying the online exploration in quantum RL with provable logarithmic worst-case regret.
翻译:尽管量子强化学习近期引起了广泛关注,但其理论基础仍较为薄弱。特别是,如何设计能够处理探索-利用权衡问题的可证明高效量子强化学习算法仍是一个悬而未决的难题。为此,我们提出了一种新型的UCRL风格算法,该算法利用量子计算处理具有$S$个状态、$A$个动作和时域$H$的表格式马尔可夫决策过程,并建立了其$\mathcal{O}(\mathrm{poly}(S, A, H, \log T))$的最坏情况遗憾界,其中$T$为回合数。此外,我们将研究结果拓展至线性函数逼近的量子强化学习,该方法能够处理具有大规模状态空间的问题。具体而言,我们针对具有$d$维线性表示的线性混合马尔可夫决策过程,开发了一种基于值目标回归的量子算法,并证明其具有$\mathcal{O}(\mathrm{poly}(d, H, \log T))$的遗憾界。我们的算法是经典强化学习中UCRL/UCRL-VTR算法的变体,同时创新性地结合了惰性更新机制与量子估计子程序。这是突破经典强化学习中$\Omega(\sqrt{T})$遗憾壁垒的关键。据我们所知,这是首个研究量子强化学习在线探索问题并实现可证明对数最坏情况遗憾的工作。