Large Language Models (LLMs) have shown promise in the autonomous driving sector, particularly in generalization and interpretability. We introduce a unique object-level multimodal LLM architecture that merges vectorized numeric modalities with a pre-trained LLM to improve context understanding in driving situations. We also present a new dataset of 160k QA pairs derived from 10k driving scenarios, paired with high quality control commands collected with RL agent and question answer pairs generated by teacher LLM (GPT-3.5). A distinct pretraining strategy is devised to align numeric vector modalities with static LLM representations using vector captioning language data. We also introduce an evaluation metric for Driving QA and demonstrate our LLM-driver's proficiency in interpreting driving scenarios, answering questions, and decision-making. Our findings highlight the potential of LLM-based driving action generation in comparison to traditional behavioral cloning. We make our benchmark, datasets, and model available for further exploration.
翻译:大型语言模型(LLMs)在自动驾驶领域展现出潜力,尤其在泛化性和可解释性方面。我们提出了一种独特的对象级多模态LLM架构,将向量化数值模态与预训练的LLM相结合,以提升驾驶情境中的上下文理解能力。我们还发布了一个包含10k个驾驶场景中生成的160k个问答对的新数据集,并配以通过强化学习代理收集的高质量控制指令和由教师LLM(GPT-3.5)生成的问答对。设计了一种独特的预训练策略,利用向量化字幕语言数据将数值向量模态与静态LLM表示对齐。我们引入了一种用于驾驶问答的评估指标,并展示了我们的LLM驱动系统在解读驾驶场景、回答问题及决策方面的能力。研究结果凸显了基于LLM的驾驶动作生成相对于传统行为克隆的潜力。我们公开了基准测试、数据集和模型,以供进一步探索。