Dispatch in three-sided marketplaces provides a natural setting for reinforcement learning from world feedback: decisions are evaluated by delayed operational outcomes such as delivery speed, courier utilization, and merchant congestion. We present a deployed reinforcement learning system at DoorDash that adapts dispatch objective weights in a large-scale food-delivery marketplace using delayed signals. Rather than replacing the combinatorial assignment optimizer, a store-level policy learned from logged marketplace data selects a discrete multiplier that shifts the dispatch optimizer's tradeoff between delivery quality and batching efficiency. This interface enables offline policy learning under noisy, delayed, and coupled feedback while preserving production feasibility constraints and operational safeguards. We train a shared value function using centralized offline data and decentralized store-level execution, with Double Q-learning targets and a conservative regularizer to reduce out-of-distribution value overestimation. In a production switchback experiment, the offline-trained policy increases batching and reduces courier-side time costs without degrading customer-facing delivery quality. Results illustrate how world feedback from a live economic and logistics system can be used to safely adapt decision policies online.
翻译:三方市场中的调度为从世界反馈中进行强化学习提供了天然场景:决策通过延迟运营结果(如配送速度、骑手利用率、商家拥堵程度)进行评估。我们在DoorDash提出了一套已部署的强化学习系统,该系统利用延迟信号在大规模食品配送市场中自适应调整调度目标权重。该方法并非替代组合分配优化器,而是通过从历史市场数据中学到的店铺级策略来选择离散乘数,从而调整调度优化器在配送质量与批处理效率之间的权衡。这种接口使得在噪声、延迟且耦合的反馈下进行离线策略学习成为可能,同时保留生产可行性约束与运营保障。我们利用集中式离线数据和分散式店铺级执行训练共享值函数,采用双Q学习目标与保守正则化器以减少分布外价值高估。在生产切换实验中,离线训练的策略在未降低面向客户的配送质量前提下,提升了批处理效率并降低了骑手端时间成本。结果展示了如何利用真实经济与物流系统的世界反馈,安全地在线上自适应调整决策策略。