Trajectory planning is a core task in autonomous driving, requiring the prediction of safe and comfortable paths across diverse scenarios. Integrating Multi-modal Large Language Models (MLLMs) with Reinforcement Learning (RL) has shown promise in addressing "long-tail" scenarios. However, existing methods are constrained to single-turn reasoning, limiting their ability to handle complex tasks requiring iterative refinement. To overcome this limitation, we present MTDrive, a multi-turn framework that enables MLLMs to iteratively refine trajectories based on environmental feedback. MTDrive introduces Multi-Turn Group Relative Policy Optimization (mtGRPO), which mitigates reward sparsity by computing relative advantages across turns. We further construct an interactive trajectory understanding dataset from closed-loop simulation to support multi-turn training. Experiments on the NAVSIM benchmark demonstrate superior performance compared to existing methods, validating the effectiveness of our multi-turn reasoning paradigm. Additionally, we implement system-level optimizations to reduce data transfer overhead caused by high-resolution images and multi-turn sequences, achieving 2.5x training throughput. Our data, models, and code will be made available soon.
翻译:轨迹规划是自动驾驶的核心任务,需要预测不同场景下安全舒适的路径。将多模态大语言模型与强化学习相结合,在应对"长尾"场景方面展现出潜力。然而,现有方法局限于单轮推理,难以处理需要迭代优化的复杂任务。为突破此限制,本文提出MTDrive——一个支持MLLM基于环境反馈迭代优化轨迹的多轮交互框架。MTDrive提出多轮组相对策略优化算法,通过跨轮次计算相对优势值缓解奖励稀疏性问题。我们进一步基于闭环仿真构建交互式轨迹理解数据集以支持多轮训练。在NAVSIM基准测试中的实验表明,该方法性能优于现有方案,验证了多轮推理范式的有效性。此外,我们实施了系统级优化以降低高分辨率图像和多轮序列导致的数据传输开销,实现了2.5倍的训练吞吐量提升。相关数据、模型与代码即将开源。