While quantum reinforcement learning (RL) has attracted a surge of attention recently, its theoretical understanding is limited. In particular, it remains elusive how to design provably efficient quantum RL algorithms that can address the exploration-exploitation trade-off. To this end, we propose a novel UCRL-style algorithm that takes advantage of quantum computing for tabular Markov decision processes (MDPs) with $S$ states, $A$ actions, and horizon $H$, and establish an $\mathcal{O}(\mathrm{poly}(S, A, H, \log T))$ worst-case regret for it, where $T$ is the number of episodes. Furthermore, we extend our results to quantum RL with linear function approximation, which is capable of handling problems with large state spaces. Specifically, we develop a quantum algorithm based on value target regression (VTR) for linear mixture MDPs with $d$-dimensional linear representation and prove that it enjoys $\mathcal{O}(\mathrm{poly}(d, H, \log T))$ regret. Our algorithms are variants of UCRL/UCRL-VTR algorithms in classical RL, which also leverage a novel combination of lazy updating mechanisms and quantum estimation subroutines. This is the key to breaking the $\Omega(\sqrt{T})$-regret barrier in classical RL. To the best of our knowledge, this is the first work studying the online exploration in quantum RL with provable logarithmic worst-case regret.
翻译:尽管量子强化学习近来引起了广泛关注,但其理论理解仍然有限。特别是,如何设计能够处理探索-利用权衡的可证明高效量子强化学习算法,仍然是一个悬而未决的问题。为此,我们提出了一种新颖的UCRL风格算法,该算法利用量子计算来处理具有$S$个状态、$A$个动作和视野$H$的表格型马尔可夫决策过程,并为其建立了$\mathcal{O}(\mathrm{poly}(S, A, H, \log T))$的最坏情况遗憾上界,其中$T$是回合数。此外,我们将结果扩展到具有线性函数近似的量子强化学习,该方法能够处理具有大状态空间的问题。具体而言,我们为具有$d$维线性表示的线性混合MDPs开发了一种基于价值目标回归的量子算法,并证明其具有$\mathcal{O}(\mathrm{poly}(d, H, \log T))$的遗憾上界。我们的算法是经典强化学习中UCRL/UCRL-VTR算法的变体,它们同样利用了惰性更新机制与量子估计子程序的新颖组合。这是突破经典强化学习中$\Omega(\sqrt{T})$遗憾下界的关键。据我们所知,这是首个研究具有可证明对数最坏情况遗憾的量子强化学习在线探索的工作。