The rise of multimodal large language models (MLLMs) has spurred interest in language-based driving tasks. However, existing research typically focuses on limited tasks and often omits key multi-view and temporal information which is crucial for robust autonomous driving. To bridge these gaps, we introduce NuInstruct, a novel dataset with 91K multi-view video-QA pairs across 17 subtasks, where each task demands holistic information (e.g., temporal, multi-view, and spatial), significantly elevating the challenge level. To obtain NuInstruct, we propose a novel SQL-based method to generate instruction-response pairs automatically, which is inspired by the driving logical progression of humans. We further present BEV-InMLLM, an end-to-end method for efficiently deriving instruction-aware Bird's-Eye-View (BEV) features, language-aligned for large language models. BEV-InMLLM integrates multi-view, spatial awareness, and temporal semantics to enhance MLLMs' capabilities on NuInstruct tasks. Moreover, our proposed BEV injection module is a plug-and-play method for existing MLLMs. Our experiments on NuInstruct demonstrate that BEV-InMLLM significantly outperforms existing MLLMs, e.g. around 9% improvement on various tasks. We plan to release our NuInstruct for future research development.
翻译:多模态大语言模型(MLLMs)的兴起激发了基于语言的驾驶任务研究兴趣。然而,现有研究通常聚焦于有限任务,且常忽略对鲁棒自动驾驶至关重要的多视角与时序信息。为弥补这些不足,我们提出NuInstruct数据集,包含91K个跨17个子任务的多视角视频问答对,每个任务均需整合时序、多视角及空间等全局信息,显著提升了任务难度。为构建NuInstruct,我们受人类驾驶逻辑推理过程启发,提出了一种基于SQL的自动生成指令-响应对方法。我们进一步提出BEV-InMLLM,这是一种端到端方法,可高效提取与语言对齐的指令感知鸟瞰图(BEV)特征,以适配大语言模型。BEV-InMLLM整合了多视角、空间感知及时序语义,增强了MLLMs在NuInstruct任务上的能力。此外,我们提出的BEV注入模块是一种即插即用方法,可适配现有MLLMs。在NuInstruct上的实验表明,BEV-InMLLM显著优于现有MLLMs,例如在各类任务上提升约9%。我们计划公开NuInstruct以促进未来研究发展。