Monte Carlo (MC) reinforcement learning suffers from high sample complexity, especially in environments with sparse rewards, large state spaces, and correlated trajectories. We address these limitations by reformulating episode selection as a Quadratic Unconstrained Binary Optimization (QUBO) problem and solving it with quantum-inspired samplers. Our method, MC+QUBO, integrates a combinatorial filtering step into standard MC policy evaluation: from each batch of trajectories, we select a subset that maximizes cumulative reward while promoting state-space coverage. This selection is encoded as a QUBO, where linear terms favor high-reward episodes and quadratic terms penalize redundancy. We explore both Simulated Quantum Annealing (SQA) and Simulated Bifurcation (SB) as black-box solvers within this framework. Experiments in a finite-horizon GridWorld demonstrate that MC+QUBO outperforms vanilla MC in convergence speed and final policy quality, highlighting the potential of quantum-inspired optimization as a decision-making subroutine in reinforcement learning.
翻译:蒙特卡洛强化学习存在样本复杂度高的问题,在奖励稀疏、状态空间大且轨迹相关的环境中尤为突出。我们通过将情节选择重新表述为二次无约束二进制优化问题,并利用量子启发的采样器进行求解,以应对这些局限性。我们的方法MC+QUBO将组合过滤步骤集成到标准蒙特卡洛策略评估中:从每批轨迹中,我们选择一个子集,该子集在最大化累积奖励的同时促进状态空间的覆盖。此选择被编码为一个QUBO问题,其中线性项偏好高奖励情节,二次项惩罚冗余性。我们在此框架内探索了模拟量子退火与模拟分岔作为黑盒求解器的应用。在有限时域GridWorld环境中的实验表明,MC+QUBO在收敛速度和最终策略质量上均优于原始蒙特卡洛方法,凸显了量子启发优化作为强化学习中决策子程序的潜力。