We introduce iMotion-LLM: a Multimodal Large Language Models (LLMs) with trajectory prediction, tailored to guide interactive multi-agent scenarios. Different from conventional motion prediction approaches, iMotion-LLM capitalizes on textual instructions as key inputs for generating contextually relevant trajectories. By enriching the real-world driving scenarios in the Waymo Open Dataset with textual motion instructions, we created InstructWaymo. Leveraging this dataset, iMotion-LLM integrates a pretrained LLM, fine-tuned with LoRA, to translate scene features into the LLM input space. iMotion-LLM offers significant advantages over conventional motion prediction models. First, it can generate trajectories that align with the provided instructions if it is a feasible direction. Second, when given an infeasible direction, it can reject the instruction, thereby enhancing safety. These findings act as milestones in empowering autonomous navigation systems to interpret and predict the dynamics of multi-agent environments, laying the groundwork for future advancements in this field.
翻译:我们提出iMotion-LLM:一种具备轨迹预测能力的多模态大语言模型(LLM),专为引导交互式多智能体场景而设计。与传统的运动预测方法不同,iMotion-LLM将文本指令作为关键输入,用于生成符合上下文情境的轨迹。通过为Waymo开放数据集的真实驾驶场景添加文本运动指令,我们构建了InstructWaymo数据集。利用该数据集,iMotion-LLM集成了预训练的大语言模型,并通过LoRA进行微调,将场景特征映射至LLM的输入空间。iMotion-LLM相比传统运动预测模型具有显著优势。首先,在可行方向上,它能生成与所给指令一致的轨迹;其次,当收到不可行方向指令时,它可以拒绝该指令,从而提升安全性。这些成果标志着自主导航系统在理解和预测多智能体环境动态方面取得了重要进展,为该领域的未来发展奠定了基础。