When does learning pay off? A study on DRL-based dynamic algorithm configuration for carbon-aware scheduling

Deep reinforcement learning (DRL) has recently emerged as a promising tool for Dynamic Algorithm Configuration (DAC), enabling evolutionary algorithms to adapt their parameters online rather than relying on static tuned configurations. While DRL can learn effective control policies, training is computationally expensive. This cost may be justified if learned policies generalize, allowing the training effort to transfer across instance types and problem scales. Yet, for real-world optimization problems, it remains unclear whether this promise holds in practice and under which conditions the investment in learning pays off. In this work, we investigate this question in the context of the carbon-aware permutation flow-shop scheduling problem. We develop a DRL-based DAC framework and train it exclusively on small, simple instances. We then deploy the learned policy on both similar and more complex unseen instances and compare its performance against a static tuned baseline, which provides a fair point of comparison. Our findings show that the proposed method provides a strong dynamic algorithm control policy that can be effectively transferred to different unseen problem instances. Notably, on simple and cheap to compute instances, similar to those observed during training and tuning, DRL performs comparably with the statically tuned baseline. However, as instance characteristics diverge and computational complexities increase, the DRL-learned policy continuously outperforms static tuning. These results confirm that DRL can acquire robust and generalizable control policies which are effective beyond the training instance distributions. This ability to generalize across instance types makes the initial computational investment worthwhile, particularly in settings where static tuning struggles to adapt to changing problem scenarios.

翻译：深度强化学习（DRL）近期在动态算法配置（DAC）领域展现出重要潜力，使进化算法能够在线自适应调整参数参数，而非依赖静态调优配置。尽管DRL能够学习有效的控制策略，但其训练过程计算成本高昂。若学习策略具备泛化能力，使训练成果能迁移至不同实例类型和问题规模，则该成本或可得到补偿。然而在真实世界优化问题中，这种泛化承诺在实际场景下是否成立，以及何种条件下学习投入能够获得回报，目前仍不明确。本研究针对碳感知排列流水车间调度问题展开探讨：我们构建了基于DRL的动态算法配置框架，并仅在小规模简单实例上完成训练。随后将学习策略部署于相似及更复杂的未见实例，并与提供公平比较基准的静态调优基线进行对比。实验结果表明，所提方法能够生成强大的动态算法控制策略，并可有效迁移至不同未见问题实例。值得注意的是，在计算成本低廉的简单实例上（与训练和调优阶段所见实例相似），DRL的性能与静态调优基线相当。然而随着实例特征差异增大和计算复杂度提升，DRL学习策略持续优于静态调优。这些结果证实，DRL能够获取鲁棒且可泛化的控制策略，其有效性超越训练实例分布范围。这种跨实例类型的泛化能力使得初始计算投入物有所值，特别是在静态调优难以适应动态问题场景的设定中。