AppleVLM: End-to-end Autonomous Driving with Advanced Perception and Planning-Enhanced Vision-Language Models

End-to-end autonomous driving has emerged as a promising paradigm integrating perception, decision-making, and control within a unified learning framework. Recently, Vision-Language Models (VLMs) have gained significant attention for their potential to enhance the robustness and generalization of end-to-end driving models in diverse and unseen scenarios. However, existing VLM-based approaches still face challenges, including suboptimal lane perception, language understanding biases, and difficulties in handling corner cases. To address these issues, we propose AppleVLM, an advanced perception and planning-enhanced VLM model for robust end-to-end driving. AppleVLM introduces a novel vision encoder and a planning strategy encoder to improve perception and decision-making. Firstly, the vision encoder fuses spatial-temporal information from multi-view images across multiple timesteps using a deformable transformer mechanism, enhancing robustness to camera variations and facilitating scalable deployment across different vehicle platforms. Secondly, unlike traditional VLM-based approaches, AppleVLM introduces a dedicated planning modality that encodes explicit Bird's-Eye-View spatial information, mitigating language biases in navigation instructions. Finally, a VLM decoder fine-tuned by a hierarchical Chain-of-Thought integrates vision, language, and planning features to output robust driving waypoints. We evaluate AppleVLM in closed-loop experiments on two CARLA benchmarks, achieving state-of-the-art driving performance. Furthermore, we deploy AppleVLM on an AGV platform and successfully showcase real-world end-to-end autonomous driving in complex outdoor environments.

翻译：端到端自动驾驶作为一种将感知、决策与控制集成于统一学习框架的范式，已展现出巨大潜力。近年来，视觉语言模型因其在多样化及未见场景中提升端到端驾驶模型鲁棒性与泛化能力的潜力而备受关注。然而，现有基于VLM的方法仍面临车道感知欠佳、语言理解偏差及处理极端案例困难等挑战。为解决这些问题，我们提出AppleVLM——一种融合先进感知与规划增强的VLM模型，用于实现鲁棒的端到端驾驶。AppleVLM引入了新型视觉编码器与规划策略编码器以改进感知与决策能力。首先，视觉编码器通过可变形Transformer机制融合多视角图像在多个时间步的时空信息，增强了对相机配置差异的鲁棒性，并促进了跨不同车辆平台的可扩展部署。其次，与传统基于VLM的方法不同，AppleVLM引入了专门的规划模态，对显式的鸟瞰图空间信息进行编码，从而缓解了导航指令中的语言偏差。最后，通过分层思维链微调的VLM解码器整合视觉、语言与规划特征，输出鲁棒的驾驶路径点。我们在两个CARLA基准测试中进行闭环实验评估，AppleVLM实现了最先进的驾驶性能。此外，我们将AppleVLM部署于自动导引车平台，成功在复杂户外环境中展示了实景端到端自动驾驶。