Deep reinforcement learning agents often exhibit erratic, high-frequency control behaviors that hinder real-world deployment due to excessive energy consumption and mechanical wear. We systematically investigate action smoothness regularization through higher-order derivative penalties, progressing from theoretical understanding in continuous control benchmarks to practical validation in building energy management. Our comprehensive evaluation across four continuous control environments demonstrates that third-order derivative penalties (jerk minimization) consistently achieve superior smoothness while maintaining competitive performance. We extend these findings to HVAC control systems where smooth policies reduce equipment switching by 60%, translating to significant operational benefits. Our work establishes higher-order action regularization as an effective bridge between RL optimization and operational constraints in energy-critical applications.
翻译:深度强化学习智能体常表现出不稳定、高频的控制行为,这些行为因能耗过高和机械磨损而阻碍其在实际场景中的部署。我们通过高阶导数惩罚系统性地研究了动作平滑性正则化方法,从连续控制基准任务的理论理解推进到建筑能源管理中的实际验证。在四个连续控制环境中的综合评估表明,三阶导数惩罚(加加速度最小化)能持续实现更优的平滑性,同时保持具有竞争力的性能表现。我们将这些发现扩展到暖通空调控制系统,其中平滑策略将设备切换频率降低了60%,转化为显著的实际运营效益。本研究确立了高阶动作正则化作为强化学习优化与能源关键应用中运行约束之间的有效桥梁。