Pick-up and Delivery Route Prediction (PDRP), which aims to estimate the future service route of a worker given his current task pool, has received rising attention in recent years. Deep neural networks based on supervised learning have emerged as the dominant model for the task because of their powerful ability to capture workers' behavior patterns from massive historical data. Though promising, they fail to introduce the non-differentiable test criteria into the training process, leading to a mismatch in training and test criteria. Which considerably trims down their performance when applied in practical systems. To tackle the above issue, we present the first attempt to generalize Reinforcement Learning (RL) to the route prediction task, leading to a novel RL-based framework called DRL4Route. It combines the behavior-learning abilities of previous deep learning models with the non-differentiable objective optimization ability of reinforcement learning. DRL4Route can serve as a plug-and-play component to boost the existing deep learning models. Based on the framework, we further implement a model named DRL4Route-GAE for PDRP in logistic service. It follows the actor-critic architecture which is equipped with a Generalized Advantage Estimator that can balance the bias and variance of the policy gradient estimates, thus achieving a more optimal policy. Extensive offline experiments and the online deployment show that DRL4Route-GAE improves Location Square Deviation (LSD) by 0.9%-2.7%, and Accuracy@3 (ACC@3) by 2.4%-3.2% over existing methods on the real-world dataset.
翻译:取送货路径预测(PDRP)旨在根据工人当前的任务池估计其未来的服务路径,近年来受到越来越多的关注。基于监督学习的深度神经网络已成为该任务的主流模型,因其能够从海量历史数据中有效捕捉工人的行为模式。尽管前景广阔,但这类方法未能将不可微分的测试标准引入训练过程,导致训练与测试标准不匹配。这在实际系统应用时显著降低了其性能。为解决上述问题,我们首次尝试将强化学习(RL)推广至路径预测任务,提出了一种名为DRL4Route的新型基于强化学习的框架。该框架融合了以往深度学习模型的行为学习能力与强化学习对不可微分目标的优化能力。DRL4Route可作为即插即用的组件,提升现有深度学习模型的性能。基于该框架,我们进一步为物流服务中的PDRP实现了名为DRL4Route-GAE的模型。它采用演员-评论家架构,并配备广义优势估计器,能够平衡策略梯度估计的偏差与方差,从而实现更优的策略。大量离线实验与在线部署表明,在真实数据集上,DRL4Route-GAE相较于现有方法,将位置平方偏差(LSD)降低了0.9%-2.7%,并将精确率@3(ACC@3)提升了2.4%-3.2%。