Large Language Models (LLMs) have shown promise in the autonomous driving sector, particularly in generalization and interpretability. We introduce a unique object-level multimodal LLM architecture that merges vectorized numeric modalities with a pre-trained LLM to improve context understanding in driving situations. We also present a new dataset of 160k QA pairs derived from 10k driving scenarios, paired with high quality control commands collected with RL agent and question answer pairs generated by teacher LLM (GPT-3.5). A distinct pretraining strategy is devised to align numeric vector modalities with static LLM representations using vector captioning language data. We also introduce an evaluation metric for Driving QA and demonstrate our LLM-driver's proficiency in interpreting driving scenarios, answering questions, and decision-making. Our findings highlight the potential of LLM-based driving action generation in comparison to traditional behavioral cloning. We make our benchmark, datasets, and model available for further exploration.
翻译:大型语言模型(LLMs)在自动驾驶领域展现出潜力,尤其在泛化与可解释性方面。我们提出一种独特的多模态LLM架构,该架构将向量化数值模态与预训练LLM融合,以提升驾驶场景中的上下文理解能力。我们还发布了一个新数据集,包含从10,000个驾驶场景中提取的160K个问答对,并配合通过强化学习智能体收集的高质量控制指令以及由教师LLM(GPT-3.5)生成的问答对。我们设计了一种独特的预训练策略,利用向量字幕语言数据将数值向量模态与静态LLM表示对齐。此外,我们引入了一个驾驶问答评估指标,并展示了我们的LLM驾驶系统在解读驾驶场景、回答问题及决策方面的能力。研究结果凸显了基于LLM的驾驶行为生成相较于传统行为克隆的潜力。我们将公开基准测试、数据集及模型以供进一步探索。