Large Multimodal Models (LMMs) have demonstrated exceptional comprehension and interpretation capabilities in Autonomous Driving (AD) by incorporating large language models. Despite the advancements, current data-driven AD approaches tend to concentrate on a single dataset and specific tasks, neglecting their overall capabilities and ability to generalize. To bridge these gaps, we propose DriveMM, a general large multimodal model designed to process diverse data inputs, such as images and multi-view videos, while performing a broad spectrum of AD tasks, including perception, prediction, and planning. Initially, the model undergoes curriculum pre-training to process varied visual signals and perform basic visual comprehension and perception tasks. Subsequently, we augment and standardize various AD-related datasets to fine-tune the model, resulting in an all-in-one LMM for autonomous driving. To assess the general capabilities and generalization ability, we conduct evaluations on six public benchmarks and undertake zero-shot transfer on an unseen dataset, where DriveMM achieves state-of-the-art performance across all tasks. We hope DriveMM as a promising solution for future end-to-end autonomous driving applications in the real world. Project page with code: https://github.com/zhijian11/DriveMM.
翻译:大型多模态模型通过融合大语言模型,在自动驾驶领域展现出卓越的理解与解析能力。尽管取得了进展,当前数据驱动的自动驾驶方法往往聚焦于单一数据集和特定任务,忽视了其整体能力与泛化性能。为弥补这些不足,我们提出DriveMM——一个通用的大型多模态模型,能够处理图像、多视角视频等多种数据输入,并执行感知、预测、规划等广泛的自动驾驶任务。该模型首先通过课程式预训练处理多样视觉信号,完成基础视觉理解与感知任务;随后,我们通过增强和标准化多个自动驾驶相关数据集对模型进行微调,最终形成一体化的自动驾驶大型多模态模型。为评估其综合能力与泛化性能,我们在六个公开基准上开展评测,并在未见数据集上进行零样本迁移测试,DriveMM在所有任务中均取得了最先进的性能。我们希望DriveMM能为未来现实世界的端到端自动驾驶应用提供有前景的解决方案。项目主页与代码:https://github.com/zhijian11/DriveMM。