Reinforcement learning is a powerful framework aiming to determine optimal behavior in highly complex decision-making scenarios. This objective can be achieved using policy iteration, which requires to solve a typically large linear system of equations. We propose the variational quantum policy iteration (VarQPI) algorithm, realizing this step with a NISQ-compatible quantum-enhanced subroutine. Its scalability is supported by an analysis of the structure of generic reinforcement learning environments, laying the foundation for potential quantum advantage with utility-scale quantum computers. Furthermore, we introduce the warm-start initialization variant (WS-VarQPI) that significantly reduces resource overhead. The algorithm solves a large FrozenLake environment with an underlying 256x256-dimensional linear system, indicating its practical robustness.
翻译:强化学习是一个强大的框架,旨在确定高度复杂决策场景中的最优行为。该目标可通过策略迭代实现,这需要求解一个典型的大规模线性方程组。我们提出了变分量子策略迭代(VarQPI)算法,通过兼容NISQ的量子增强子程序完成这一步骤。其可扩展性得到对通用强化学习环境结构分析的支持,为基于效用级量子计算机的潜在量子优势奠定了基础。此外,我们引入了热启动初始化变体(WS-VarQPI),显著降低了资源开销。该算法成功求解了底层为256×256维线性系统的大规模FrozenLake环境,表明其实用鲁棒性。