While deep Reinforcement Learning (deep-RL) has been increasingly applied to parameter control in evolutionary algorithms, rigorous theoretical analysis of parameter control remains largely restricted to single-parameter settings, owing to the difficulty of deriving effective, interpretable multi-parameter policies amenable to formal study. We demonstrate how deep-RL can be leveraged to overcome this barrier, using the (1+($λ$,$λ$))-genetic algorithm optimizing OneMax, one of the few problems where a super-constant speedup of dynamic control has been formally proven, as a representative case study. We first show that standard approaches struggle to converge in this multi-parameter setting, and introduce algorithm-agnostic enhancements targeting action-space decomposition, reward shifting, and long-horizon discounting. With these in place, we compare common deep-RL methods and find that Double Deep Q-Networks uniquely avoid the policy collapse observed in Proximal Policy Optimization, yielding trajectories suitable for downstream analysis. Crucially, we move beyond the ``black-box'' nature of neural networks by distilling the learned behaviors into a transparent, symbolic control policy. This resulting policy does not only offer interpretability for future theoretical analysis but also yields exceptional performance, consistently outperforming existing baselines across a wide range of problem sizes.
翻译:尽管深度强化学习(deep-RL)已逐步应用于进化算法的参数控制问题,但由于推导适用于形式化研究且可解释的有效多参数策略存在困难,参数控制的严格理论分析目前仍主要局限于单参数场景。本文以(1+($λ$,$λ$))-遗传算法优化OneMax问题为典型案例——该问题是为数不多经理论证明可通过动态控制实现超常数加速的优化问题之一——展示如何利用深度强化学习突破这一瓶颈。我们首先发现标准方法在多参数场景中难以收敛,进而引入面向动作空间分解、奖励平移与长时域折扣的算法无关增强策略。在此基础上,通过对比主流深度强化学习方法,发现Double Deep Q-Networks能够唯一避免Proximal Policy Optimization中出现的策略崩溃现象,从而生成适用于后续分析的轨迹。关键突破在于,我们通过将习得行为蒸馏为透明的符号控制策略,超越了神经网络的“黑箱”特性。所得策略不仅可为未来理论分析提供可解释性,更展现出卓越性能,在广泛的问题规模范围内持续超越现有基准方法。