In this paper, we investigate the problem of \textit{episodic reinforcement learning} with quantum oracles for state evolution. To this end, we propose an \textit{Upper Confidence Bound} (UCB) based quantum algorithmic framework to facilitate learning of a finite-horizon MDP. Our quantum algorithm achieves an exponential improvement in regret as compared to the classical counterparts, achieving a regret of $\Tilde{\mathcal{O}}(1)$ as compared to $\Tilde{\mathcal{O}}(\sqrt{K})$ \footnote{$\Tilde{\mathcal{O}}(\cdot)$ hides logarithmic terms.}, $K$ being the number of training episodes. In order to achieve this advantage, we exploit efficient quantum mean estimation technique that provides quadratic improvement in the number of i.i.d. samples needed to estimate the mean of sub-Gaussian random variables as compared to classical mean estimation. This improvement is a key to the significant regret improvement in quantum reinforcement learning. We provide proof-of-concept experiments on various RL environments that in turn demonstrate performance gains of the proposed algorithmic framework.
翻译:本文研究基于量子或acles的状态演化问题中的情境强化学习。为此,我们提出一种基于上置信界(UCB)的量子算法框架,用于促进有限时域马尔可夫决策过程(MDP)的学习。与经典算法相比,我们的量子算法在遗憾值上实现了指数级改进,达到$\Tilde{\mathcal{O}}(1)$的遗憾值,而经典算法为$\Tilde{\mathcal{O}}(\sqrt{K})$(注:$\Tilde{\mathcal{O}}(\cdot)$隐藏了对数项),其中$K$为训练回合数。为实现这一优势,我们利用高效的量子均值估计技术,该技术在估计亚高斯随机变量的均值时,所需独立同分布样本量相比经典均值估计实现了二次改进。这一改进是量子强化学习实现显著遗憾改进的关键。我们在多种强化学习环境中进行了概念验证实验,实验结果表明所提出算法框架的性能优势。