The electric vehicle routing problem with time windows (EVRPTW) is a complex optimization problem in sustainable logistics, where routing decisions must minimize total travel distance, fleet size, and battery usage while satisfying strict customer time constraints. Although deep reinforcement learning (DRL) has shown great potential as an alternative to classical heuristics and exact solvers, existing DRL models often struggle to maintain training stability-failing to converge or generalize when constraints are dense. In this study, we propose a curriculum-based deep reinforcement learning (CB-DRL) framework designed to resolve this instability. The framework utilizes a structured three-phase curriculum that gradually increases problem complexity: the agent first learns distance and fleet optimization (Phase A), then battery management (Phase B), and finally the full EVRPTW (Phase C). To ensure stable learning across phases, the framework employs a modified proximal policy optimization algorithm with phase-specific hyperparameters, value and advantage clipping, and adaptive learning-rate scheduling. The policy network is built upon a heterogeneous graph attention encoder enhanced by global-local attention and feature-wise linear modulation. This specialized architecture explicitly captures the distinct properties of depots, customers, and charging stations. Trained exclusively on small instances with N=10 customers, the model demonstrates robust generalization to unseen instances ranging from N=5 to N=100, significantly outperforming standard baselines on medium-scale problems. Experimental results confirm that this curriculum-guided approach achieves high feasibility rates and competitive solution quality on out-of-distribution instances where standard DRL baselines fail, effectively bridging the gap between neural speed and operational reliability.
翻译:带时间窗的电动汽车路径规划问题(EVRPTW)是可持续物流领域中的一个复杂优化问题,其路径决策需在满足严格客户时间约束的同时,最小化总行驶距离、车队规模及电池能耗。尽管深度强化学习(DRL)作为传统启发式算法和精确求解器的替代方案已展现出巨大潜力,但现有DRL模型在约束密集时往往难以保持训练稳定性——无法收敛或泛化。本研究提出一种基于课程学习的深度强化学习(CB-DRL)框架以解决此类不稳定问题。该框架采用结构化的三阶段课程,逐步提升问题复杂度:智能体首先学习距离与车队规模优化(阶段A),继而掌握电池管理(阶段B),最终完成完整EVRPTW求解(阶段C)。为确保跨阶段学习的稳定性,框架采用改进的近端策略优化算法,配备阶段特异性超参数、价值与优势裁剪机制以及自适应学习率调度策略。策略网络基于异构图注意力编码器构建,并通过全局-局部注意力机制与特征线性调制技术增强。该专用架构能显式捕捉配送中心、客户节点和充电站的差异化特征。模型仅使用N=10的小规模实例进行训练,在N=5至N=100的未见实例上展现出强泛化能力,在中规模问题上显著超越标准基线方法。实验结果表明,这种课程引导方法在标准DRL基线失效的分布外实例上实现了高可行率与有竞争力的求解质量,有效弥合了神经网络求解速度与运营可靠性之间的鸿沟。