With the emergence of Large Language Models (LLMs) and Vision Foundation Models (VFMs), multimodal AI systems benefiting from large models have the potential to equally perceive the real world, make decisions, and control tools as humans. In recent months, LLMs have shown widespread attention in autonomous driving and map systems. Despite its immense potential, there is still a lack of a comprehensive understanding of key challenges, opportunities, and future endeavors to apply in LLM driving systems. In this paper, we present a systematic investigation in this field. We first introduce the background of Multimodal Large Language Models (MLLMs), the multimodal models development using LLMs, and the history of autonomous driving. Then, we overview existing MLLM tools for driving, transportation, and map systems together with existing datasets and benchmarks. Moreover, we summarized the works in The 1st WACV Workshop on Large Language and Vision Models for Autonomous Driving (LLVM-AD), which is the first workshop of its kind regarding LLMs in autonomous driving. To further promote the development of this field, we also discuss several important problems regarding using MLLMs in autonomous driving systems that need to be solved by both academia and industry.
翻译:随着大语言模型(LLMs)和视觉基础模型(VFMs)的出现,受益于大规模模型的多模态人工智能系统有望像人类一样感知真实世界、做出决策并控制工具。近几个月来,LLMs在自动驾驶和地图系统中受到广泛关注。尽管潜力巨大,但在LLM驾驶系统的关键挑战、机遇及未来发展方向方面仍缺乏全面理解。本文对该领域进行了系统性研究。我们首先介绍了多模态大语言模型(MLLMs)的背景、基于LLMs的多模态模型发展以及自动驾驶的历史。随后,概述了现有的用于驾驶、交通和地图系统的MLLM工具,以及现有数据集和基准。此外,我们总结了首届WACV自动驾驶大语言与视觉模型研讨会(LLVM-AD)的相关工作,这是首个关于LLMs在自动驾驶领域应用的研讨会。为进一步推动该领域发展,我们还讨论了在自动驾驶系统中使用MLLMs时需由学术界和工业界共同解决的若干重要问题。