OmniDrive-R1: Reinforcement-driven Interleaved Multi-modal Chain-of-Thought for Trustworthy Vision-Language Autonomous Driving

The deployment of Vision-Language Models (VLMs) in safety-critical domains like autonomous driving (AD) is critically hindered by reliability failures, most notably object hallucination. This failure stems from their reliance on ungrounded, text-based Chain-of-Thought (CoT) reasoning. While existing multi-modal CoT approaches attempt mitigation, they suffer from two fundamental flaws: (1) decoupled perception and reasoning stages that prevent end-to-end joint optimization, and (2) reliance on expensive, dense localization labels. Thus we introduce OmniDrive-R1, an end-to-end VLM framework designed for autonomous driving, which unifies perception and reasoning through an interleaved Multi-modal Chain-of-Thought (iMCoT) mechanism. Our core innovation is an Reinforcement-driven visual grounding capability, enabling the model to autonomously direct its attention and "zoom in" on critical regions for fine-grained analysis. This capability is enabled by our pure two-stage reinforcement learning training pipeline and Clip-GRPO algorithm. Crucially, Clip-GRPO introduces an annotation-free, process-based grounding reward. This reward not only eliminates the need for dense labels but also circumvents the instability of external tool calls by enforcing real-time cross-modal consistency between the visual focus and the textual reasoning. Extensive experiments on DriveLMM-o1 demonstrate our model's significant improvements. Compared to the baseline Qwen2.5VL-7B, OmniDrive-R1 improves the overall reasoning score from 51.77% to 80.35%, and the final answer accuracy from 37.81% to 73.62%.

翻译：在自动驾驶等安全关键领域部署视觉语言模型时，可靠性故障（尤其是物体幻觉问题）严重阻碍了其应用。这一故障源于模型依赖未接地的、基于文本的思维链推理。尽管现有的多模态思维链方法试图缓解此问题，但它们存在两个根本缺陷：（1）解耦的感知与推理阶段阻碍了端到端的联合优化；（2）依赖昂贵且密集的定位标注。为此，我们提出了OmniDrive-R1，一个专为自动驾驶设计的端到端视觉语言模型框架。该框架通过交错多模态思维链机制统一了感知与推理。我们的核心创新是一种基于强化驱动的视觉接地能力，使模型能够自主引导其注意力并“聚焦”于关键区域进行细粒度分析。这一能力由我们纯粹的两阶段强化学习训练流程和Clip-GRPO算法实现。关键的是，Clip-GRPO引入了一种无需标注、基于过程的接地奖励。该奖励不仅消除了对密集标注的需求，还通过强制视觉焦点与文本推理之间的实时跨模态一致性，规避了外部工具调用的不稳定性。在DriveLMM-o1上进行的大量实验证明了我们模型的显著改进。与基线模型Qwen2.5VL-7B相比，OmniDrive-R1将整体推理得分从51.77%提升至80.35%，并将最终答案准确率从37.81%提高至73.62%。