Surpassing legacy approaches to PWR core reload optimization with single-objective Reinforcement learning

Optimizing the fuel cycle cost through the optimization of nuclear reactor core loading patterns involves multiple objectives and constraints, leading to a vast number of candidate solutions that cannot be explicitly solved. To advance the state-of-the-art in core reload patterns, we have developed methods based on Deep Reinforcement Learning (DRL) for both single- and multi-objective optimization. Our previous research has laid the groundwork for these approaches and demonstrated their ability to discover high-quality patterns within a reasonable time frame. On the other hand, stochastic optimization (SO) approaches are commonly used in the literature, but there is no rigorous explanation that shows which approach is better in which scenario. In this paper, we demonstrate the advantage of our RL-based approach, specifically using Proximal Policy Optimization (PPO), against the most commonly used SO-based methods: Genetic Algorithm (GA), Parallel Simulated Annealing (PSA) with mixing of states, and Tabu Search (TS), as well as an ensemble-based method, Prioritized Replay Evolutionary and Swarm Algorithm (PESA). We found that the LP scenarios derived in this paper are amenable to a global search to identify promising research directions rapidly, but then need to transition into a local search to exploit these directions efficiently and prevent getting stuck in local optima. PPO adapts its search capability via a policy with learnable weights, allowing it to function as both a global and local search method. Subsequently, we compared all algorithms against PPO in long runs, which exacerbated the differences seen in the shorter cases. Overall, the work demonstrates the statistical superiority of PPO compared to the other considered algorithms.

翻译：通过优化核反应堆堆芯装载模式来降低燃料循环成本涉及多目标与多约束条件，导致候选解空间巨大而无法显式求解。为推进堆芯换料模式的前沿研究，我们开发了基于深度强化学习（DRL）的单目标与多目标优化方法。先前研究已为这些方法奠定基础，并证明其能在合理时间内发现高质量装载模式。另一方面，随机优化（SO）方法在文献中广泛应用，但缺乏严格理论说明何种场景下何种方法更具优势。本文通过对比实验，证明了基于强化学习的方法（特别是使用近端策略优化算法PPO）相对于最常用的SO方法——遗传算法（GA）、带状态混合的并行模拟退火算法（PSA）、禁忌搜索（TS）以及基于集成学习的优先回放进化群算法（PESA）——的优越性。研究发现，本文推导的装载模式场景适合通过全局搜索快速定位有潜力的研究方向，但随后需转入局部搜索以高效利用这些方向并避免陷入局部最优。PPO通过具有可学习权重的策略自适应调整搜索能力，使其能同时承担全局与局部搜索功能。通过延长运行时间的对比实验，各算法与PPO的差异进一步凸显。总体而言，本工作从统计学角度证明了PPO相较于其他对比算法的显著优势。