Motivated by the emergent reasoning capabilities of Vision Language Models (VLMs) and their potential to improve the comprehensibility of autonomous driving systems, this paper introduces a closed-loop autonomous driving controller called VLM-MPC, which combines the Model Predictive Controller (MPC) with VLM to evaluate how model-based control could enhance VLM decision-making. The proposed VLM-MPC is structured into two asynchronous components: The upper layer VLM generates driving parameters (e.g., desired speed, desired headway) for lower-level control based on front camera images, ego vehicle state, traffic environment conditions, and reference memory; The lower-level MPC controls the vehicle in real-time using these parameters, considering engine lag and providing state feedback to the entire system. Experiments based on the nuScenes dataset validated the effectiveness of the proposed VLM-MPC across various environments (e.g., night, rain, and intersections). The results demonstrate that the VLM-MPC consistently maintains Post Encroachment Time (PET) above safe thresholds, in contrast to some scenarios where the VLM-based control posed collision risks. Additionally, the VLM-MPC enhances smoothness compared to the real-world trajectories and VLM-based control. By comparing behaviors under different environmental settings, we highlight the VLM-MPC's capability to understand the environment and make reasoned inferences. Moreover, we validate the contributions of two key components, the reference memory and the environment encoder, to the stability of responses through ablation tests.
翻译:受视觉语言模型(VLM)新兴的推理能力及其提升自动驾驶系统可理解性潜力的启发,本文提出了一种名为VLM-MPC的闭环自动驾驶控制器,它将模型预测控制器(MPC)与VLM相结合,以评估基于模型的控制如何增强VLM的决策能力。所提出的VLM-MPC结构分为两个异步组件:上层VLM基于前视摄像头图像、自车状态、交通环境条件及参考记忆,为底层控制生成驾驶参数(例如期望速度、期望车头时距);底层MPC则利用这些参数实时控制车辆,同时考虑发动机滞后并为整个系统提供状态反馈。基于nuScenes数据集的实验验证了所提VLM-MPC在各种环境(例如夜间、雨天及交叉路口)下的有效性。结果表明,VLM-MPC能持续将侵入后时间(PET)维持在安全阈值以上,而基于VLM的控制在某些场景下则存在碰撞风险。此外,与真实世界轨迹及基于VLM的控制相比,VLM-MPC提升了行驶平顺性。通过比较不同环境设置下的行为,我们突出了VLM-MPC理解环境并进行合理推理的能力。此外,通过消融实验,我们验证了参考记忆和环境编码器这两个关键组件对响应稳定性的贡献。