The nuclear fuel loading pattern optimization problem belongs to the class of large-scale combinatorial optimization. It is also characterized by multiple objectives and constraints, which makes it impossible to solve explicitly. Stochastic optimization methodologies including Genetic Algorithms and Simulated Annealing are used by different nuclear utilities and vendors, but hand-designed solutions continue to be the prevalent method in the industry. To improve the state-of-the-art, Deep Reinforcement Learning (RL), in particular, Proximal Policy Optimization is leveraged. This work presents a first-of-a-kind approach to utilize deep RL to solve the loading pattern problem and could be leveraged for any engineering design optimization. This paper is also to our knowledge the first to propose a study of the behavior of several hyper-parameters that influence the RL algorithm. The algorithm is highly dependent on multiple factors such as the shape of the objective function derived for the core design that behaves as a fudge factor that affects the stability of the learning. But also, an exploration/exploitation trade-off that manifests through different parameters such as the number of loading patterns seen by the agents per episode, the number of samples collected before a policy update nsteps, and an entropy factor ent_coef that increases the randomness of the policy during training. We found that RL must be applied similarly to a Gaussian Process in which the acquisition function is replaced by a parametrized policy. Then, once an initial set of hyper-parameters is found, reducing nsteps and ent_coef until no more learning is observed will result in the highest sample efficiency robustly and stably. This resulted in an economic benefit of 535,000- 642,000 $/year/plant.
翻译:核燃料装载模式优化问题属于大规模组合优化范畴,具有多目标、多约束的特点,因此无法解析求解。遗传算法和模拟退火等随机优化方法已被不同核电运营商及供应商采用,但人工设计方案仍是行业主流。为改进现有技术,本文利用深度强化学习,特别是近端策略优化算法。本研究首次提出利用深度强化学习解决装载模式问题的方法,该方法可推广至任何工程设计优化场景。据我们所知,本文也是首个针对影响强化学习算法的多个超参数行为的研究。该算法高度依赖于多个因素,例如为堆芯设计推导的目标函数形状(该函数如同调节因子,影响学习稳定性),以及探索/利用权衡——该权衡通过不同参数体现,包括智能体每回合观察的装载模式数量、策略更新前收集的样本数nsteps,以及增加训练策略随机性的熵系数ent_coef。我们发现强化学习的应用方式应与高斯过程类似,即用参数化策略替代采集函数。一旦找到初始超参数组,将nsteps和ent_coef逐步降低至学习不再发生,即可稳健高效地获得最高样本利用率。该方法可产生每年每电厂535,000-642,000美元的经济效益。