On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving

Licheng Wen,Xuemeng Yang,Daocheng Fu,Xiaofeng Wang,Pinlong Cai,Xin Li,Tao Ma,Yingxuan Li,Linran Xu,Dengke Shang,Zheng Zhu,Shaoyan Sun,Yeqi Bai,Xinyu Cai,Min Dou,Shuanglu Hu,Botian Shi,Yu Qiao

The pursuit of autonomous driving technology hinges on the sophisticated integration of perception, decision-making, and control systems. Traditional approaches, both data-driven and rule-based, have been hindered by their inability to grasp the nuance of complex driving environments and the intentions of other road users. This has been a significant bottleneck, particularly in the development of common sense reasoning and nuanced scene understanding necessary for safe and reliable autonomous driving. The advent of Visual Language Models (VLM) represents a novel frontier in realizing fully autonomous vehicle driving. This report provides an exhaustive evaluation of the latest state-of-the-art VLM, GPT-4V(ision), and its application in autonomous driving scenarios. We explore the model's abilities to understand and reason about driving scenes, make decisions, and ultimately act in the capacity of a driver. Our comprehensive tests span from basic scene recognition to complex causal reasoning and real-time decision-making under varying conditions. Our findings reveal that GPT-4V demonstrates superior performance in scene understanding and causal reasoning compared to existing autonomous systems. It showcases the potential to handle out-of-distribution scenarios, recognize intentions, and make informed decisions in real driving contexts. However, challenges remain, particularly in direction discernment, traffic light recognition, vision grounding, and spatial reasoning tasks. These limitations underscore the need for further research and development. Project is now available on GitHub for interested parties to access and utilize: \url{https://github.com/PJLab-ADG/GPT4V-AD-Exploration}

翻译：自主驾驶技术的实现依赖于感知、决策与控制系统的高度整合。传统方法（包括数据驱动与基于规则的方法）由于无法把握复杂驾驶环境中的细微差异以及其他道路使用者的意图，一直面临瓶颈。这一问题在需要常识推理与细致场景理解以实现安全可靠自动驾驶的领域尤为突出。视觉-语言模型（VLM）的出现为完全自动驾驶车辆的实现开辟了全新方向。本报告对最新最先进的VLM——GPT-4V(ision)及其在自动驾驶场景中的应用进行了全面评估。我们探讨了该模型在理解与推理驾驶场景、做出决策以及最终以驾驶员身份行动方面的能力。我们的综合测试涵盖从基础场景识别到复杂因果推理以及不同条件下实时决策的多个方面。研究发现，与现有自动驾驶系统相比，GPT-4V在场景理解与因果推理方面表现出更优性能，展现了应对分布外场景、识别意图以及在真实驾驶环境中做出明智决策的潜力。然而，在方向辨别、交通灯识别、视觉接地与空间推理等任务上仍存在挑战，这些局限性凸显了进一步研究与开发的必要性。项目现已开源，供感兴趣的研究者访问与使用：\url{https://github.com/PJLab-ADG/GPT4V-AD-Exploration}