Autonomous driving requires a comprehensive understanding of 3D environments to facilitate high-level tasks such as motion prediction, planning, and mapping. In this paper, we introduce DriveMLLM, a benchmark specifically designed to evaluate the spatial understanding capabilities of multimodal large language models (MLLMs) in autonomous driving. DriveMLLM includes 2,734 front-facing camera images and introduces both absolute and relative spatial reasoning tasks, accompanied by linguistically diverse natural language questions. To measure MLLMs' performance, we propose novel evaluation metrics focusing on spatial understanding. We evaluate several state-of-the-art MLLMs on DriveMLLM, and our results reveal the limitations of current models in understanding complex spatial relationships in driving contexts. We believe these findings underscore the need for more advanced MLLM-based spatial reasoning methods and highlight the potential for DriveMLLM to drive further research in autonomous driving. Code will be available at \url{https://github.com/XiandaGuo/Drive-MLLM}.
翻译:自动驾驶需要对三维环境有全面的理解,以支持运动预测、规划与建图等高级任务。本文介绍了DriveMLLM,这是一个专门用于评估多模态大语言模型在自动驾驶场景中空间理解能力的基准。DriveMLLM包含2,734张前向摄像头图像,并引入了绝对与相对空间推理任务,同时辅以语言多样化的自然语言问题。为衡量MLLMs的性能,我们提出了专注于空间理解的新型评估指标。我们在DriveMLLM上对多个前沿MLLMs进行了评估,结果表明当前模型在理解驾驶场景中复杂空间关系方面存在局限。我们相信这些发现凸显了对更先进的基于MLLM的空间推理方法的需求,并表明DriveMLLM具备推动自动驾驶领域进一步研究的潜力。代码将在 \url{https://github.com/XiandaGuo/Drive-MLLM} 发布。