Large Multimodal Models (LMMs) have demonstrated exceptional comprehension and interpretation capabilities in Autonomous Driving (AD) by incorporating large language models. Despite the advancements, current data-driven AD approaches tend to concentrate on a single dataset and specific tasks, neglecting their overall capabilities and ability to generalize. To bridge these gaps, we propose DriveMM, a general large multimodal model designed to process diverse data inputs, such as images and multi-view videos, while performing a broad spectrum of AD tasks, including perception, prediction, and planning. Initially, the model undergoes curriculum pre-training to process varied visual signals and perform basic visual comprehension and perception tasks. Subsequently, we augment and standardize various AD-related datasets to fine-tune the model, resulting in an all-in-one LMM for autonomous driving. To assess the general capabilities and generalization ability, we conduct evaluations on six public benchmarks and undertake zero-shot transfer on an unseen dataset, where DriveMM achieves state-of-the-art performance across all tasks. We hope DriveMM as a promising solution for future end-to-end autonomous driving applications in the real world. Project page with code: https://github.com/zhijian11/DriveMM.
翻译:大型多模态模型通过结合大语言模型,在自动驾驶领域展现出卓越的理解与解析能力。尽管取得了进展,当前数据驱动的自动驾驶方法往往集中于单一数据集和特定任务,忽视了其整体能力与泛化性能。为弥补这些不足,我们提出DriveMM,一个通用的大型多模态模型,旨在处理多样化的数据输入(如图像与多视角视频),同时执行广泛的自动驾驶任务,包括感知、预测与规划。首先,模型通过课程式预训练处理多种视觉信号并完成基础视觉理解与感知任务。随后,我们通过增强和标准化多个自动驾驶相关数据集对模型进行微调,最终得到一个面向自动驾驶的一体化大型多模态模型。为评估其综合能力与泛化性能,我们在六个公开基准上进行了评测,并在未见数据集上进行了零样本迁移测试,DriveMM在所有任务中均取得了最先进的性能。我们希望DriveMM能为未来现实世界的端到端自动驾驶应用提供一个有前景的解决方案。项目页面与代码:https://github.com/zhijian11/DriveMM。