The rapid growth of autonomous driving datasets has enabled the scaling of powerful motion forecasting models. While large-scale pretraining provides strong performance, the standard imitation objective may not fully capture the complex nuances of human driving preferences. Meanwhile, recent advances in vision-language models (VLMs) have demonstrated impressive reasoning and commonsense understanding. Building on these capabilities, this paper presents VL-DPO, a vision-language-guided framework that aligns ego-vehicle motion forecasting models with human preferences. Our approach leverages a VLM as a zero-shot reasoner to automatically generate preference pairs from a pretrained model's rollouts, which are then used to finetune the model via Direct Preference Optimization (DPO). We finetune our models on the Waymo Open End-to-End Driving Dataset (WOD-E2E) and evaluate performance against held-out human preference annotations using rater feedback score (RFS) and average displacement error (ADE). Our experiments confirm that the VLM's trajectory selection is a high-quality proxy for human preference. Our final model, VL-DPO, yields an 11.94% increase in RFS and a 10.01% reduction in ADE over the pretrained model.
翻译:自动驾驶数据集的快速增长使得大规模运动预测模型的扩展成为可能。尽管大规模预训练提供了强大的性能,但标准的模仿目标可能无法完全捕捉人类驾驶偏好的复杂细微差别。与此同时,视觉-语言模型(VLM)的最新进展展示了令人印象深刻的推理能力和常识理解。基于这些能力,本文提出了VL-DPO,一种视觉-语言引导的框架,将自我车辆运动预测模型与人类偏好对齐。我们的方法利用VLM作为零样本推理器,自动从预训练模型的滚动输出中生成偏好对,然后通过这些偏好对使用直接偏好优化(DPO)对模型进行微调。我们在Waymo开放端到端驾驶数据集(WOD-E2E)上微调模型,并使用评分者反馈分数(RFS)和平均位移误差(ADE)评估与保留的人类偏好注释对比的性能。我们的实验证实,VLM的轨迹选择是人类偏好的高质量代理。最终模型VL-DPO在预训练模型基础上实现了RFS提升11.94%和ADE降低10.01%。