On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving

Licheng Wen,Xuemeng Yang,Daocheng Fu,Xiaofeng Wang,Pinlong Cai,Xin Li,Tao Ma,Yingxuan Li,Linran Xu,Dengke Shang,Zheng Zhu,Shaoyan Sun,Yeqi Bai,Xinyu Cai,Min Dou,Shuanglu Hu,Botian Shi

The pursuit of autonomous driving technology hinges on the sophisticated integration of perception, decision-making, and control systems. Traditional approaches, both data-driven and rule-based, have been hindered by their inability to grasp the nuance of complex driving environments and the intentions of other road users. This has been a significant bottleneck, particularly in the development of common sense reasoning and nuanced scene understanding necessary for safe and reliable autonomous driving. The advent of Visual Language Models (VLM) represents a novel frontier in realizing fully autonomous vehicle driving. This report provides an exhaustive evaluation of the latest state-of-the-art VLM, \modelnamefull, and its application in autonomous driving scenarios. We explore the model's abilities to understand and reason about driving scenes, make decisions, and ultimately act in the capacity of a driver. Our comprehensive tests span from basic scene recognition to complex causal reasoning and real-time decision-making under varying conditions. Our findings reveal that \modelname demonstrates superior performance in scene understanding and causal reasoning compared to existing autonomous systems. It showcases the potential to handle out-of-distribution scenarios, recognize intentions, and make informed decisions in real driving contexts. However, challenges remain, particularly in direction discernment, traffic light recognition, vision grounding, and spatial reasoning tasks. These limitations underscore the need for further research and development. Project is now available on GitHub for interested parties to access and utilize: \url{https://github.com/PJLab-ADG/GPT4V-AD-Exploration}

翻译：自动驾驶技术的实现依赖于感知、决策与控制系统的精密集成。传统方法（无论是数据驱动还是基于规则）受限于其难以把握复杂驾驶环境的细微差异及其他道路使用者的意图，这已成为实现安全可靠自动驾驶的重大瓶颈，尤其在常识推理和精细场景理解方面。视觉语言模型（VLM）的出现为全自动驾驶汽车的实现开辟了全新领域。本报告对最新最先进的VLM——GPT-4V(ision)在自动驾驶场景中的应用进行了详尽评估。我们探索了该模型在理解与推理驾驶场景、做出决策以及最终作为驾驶员执行操作方面的能力。综合测试涵盖从基础场景识别到复杂因果推理及不同条件下的实时决策。研究结果表明，与现有自动驾驶系统相比，GPT-4V在场景理解和因果推理方面表现更优，展现了处理分布外场景、识别意图并在真实驾驶环境中做出明智决策的潜力。然而，挑战依然存在，特别是在方向辨别、交通灯识别、视觉定位及空间推理任务上。这些局限性突显了进一步研究开发的必要性。相关项目现已在GitHub上公开，供感兴趣的研究者访问和使用：\url{https://github.com/PJLab-ADG/GPT4V-AD-Exploration}