HiLM-D: Towards High-Resolution Understanding in Multimodal Large Language Models for Autonomous Driving

Autonomous driving systems generally employ separate models for different tasks resulting in intricate designs. For the first time, we leverage singular multimodal large language models (MLLMs) to consolidate multiple autonomous driving tasks from videos, i.e., the Risk Object Localization and Intention and Suggestion Prediction (ROLISP) task. ROLISP uses natural language to simultaneously identify and interpret risk objects, understand ego-vehicle intentions, and provide motion suggestions, eliminating the necessity for task-specific architectures. However, lacking high-resolution (HR) information, existing MLLMs often miss small objects (e.g., traffic cones) and overly focus on salient ones (e.g., large trucks) when applied to ROLISP. We propose HiLM-D (Towards High-Resolution Understanding in MLLMs for Autonomous Driving), an efficient method to incorporate HR information into MLLMs for the ROLISP task. Especially, HiLM-D integrates two branches: (i) the low-resolution reasoning branch, can be any MLLMs, processes low-resolution videos to caption risk objects and discern ego-vehicle intentions/suggestions; (ii) the high-resolution perception branch (HR-PB), prominent to HiLM-D,, ingests HR images to enhance detection by capturing vision-specific HR feature maps and prioritizing all potential risks over merely salient objects. Our HR-PB serves as a plug-and-play module, seamlessly fitting into current MLLMs. Experiments on the ROLISP benchmark reveal HiLM-D's notable advantage over leading MLLMs, with improvements of 4.8% in BLEU-4 for captioning and 17.2% in mIoU for detection.

翻译：自动驾驶系统通常为不同任务采用独立模型，导致架构设计复杂。我们首次利用单一多模态大语言模型（MLLMs）整合来自视频的多项自动驾驶任务，即风险目标定位与意图及建议预测（ROLISP）任务。ROLISP通过自然语言同时识别和解释风险目标、理解自车意图并提供运动建议，无需针对特定任务的架构。然而，现有MLLMs因缺乏高分辨率（HR）信息，在应用于ROLISP时常遗漏小目标（如交通锥）并过度关注显著目标（如大型卡车）。我们提出HiLM-D（面向自动驾驶MLLMs的高分辨率理解方法），一种将HR信息高效融入MLLMs以完成ROLISP任务的方法。具体而言，HiLM-D集成了两个分支：（i）低分辨率推理分支，可采用任意MLLMs，处理低分辨率视频以描述风险目标并识别自车意图/建议；（ii）高分辨率感知分支（HR-PB），为HiLM-D的核心，通过捕捉视觉特异性HR特征图并优先关注所有潜在风险（而非仅显著目标），增强对HR图像的检测能力。我们的HR-PB作为即插即用模块，可无缝适配当前MLLMs。在ROLISP基准上的实验表明，HiLM-D相较于领先MLLMs优势显著，在描述任务中BLEU-4提升4.8%，在检测任务中mIoU提升17.2%。