Alpamayo-R1：桥接推理与行为预测以实现长尾场景下可泛化的自动驾驶 (Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail)

NVIDIA, :,Yan Wang,Wenjie Luo,Junjie Bai,Yulong Cao,Tong Che,Ke Chen,Yuxiao Chen,Jenna Diamond,Yifan Ding,Wenhao Ding,Liang Feng,Greg Heinrich,Jack Huang,Peter Karkus,Boyi Li,Pinyi Li,Tsung-Yi Lin,Dongran Liu,Ming-Yu Liu,Langechuan Liu,Zhijian Liu,Jason Lu,Yunxiang Mao,Pavlo Molchanov,Lindsey Pavao,Zhenghao Peng,Mike Ranzinger,Ed Schmerling,Shida Shen,Yunfei Shi,Sarah Tariq,Ran Tian,Tilman Wekel,Xinshuo Weng,Tianjun Xiao,Eric Yang,Xiaodong Yang,Yurong You,Xiaohui Zeng,Wenyuan Zhang,Boris Ivanovic,Marco Pavone

End-to-end architectures trained via imitation learning have advanced autonomous driving by scaling model size and data, yet performance remains brittle in safety-critical long-tail scenarios where supervision is sparse and causal understanding is limited. We introduce Alpamayo-R1 (AR1), a vision-language-action model (VLA) that integrates Chain of Causation reasoning with trajectory planning for complex driving scenarios. Our approach features three key innovations: (1) the Chain of Causation (CoC) dataset, built through a hybrid auto-labeling and human-in-the-loop pipeline producing decision-grounded, causally linked reasoning traces aligned with driving behaviors; (2) a modular VLA architecture combining Cosmos-Reason, a vision-language model pre-trained for Physical AI, with a diffusion-based trajectory decoder that generates dynamically feasible trajectories in real time; (3) a multi-stage training strategy using supervised fine-tuning to elicit reasoning and reinforcement learning (RL) to enforce reasoning-action consistency and optimize reasoning quality. AR1 achieves up to a 12% improvement in planning accuracy on challenging cases compared to a trajectory-only baseline, with a 35% reduction in close encounter rate in closed-loop simulation. RL post-training improves reasoning quality by 45% and reasoning-action consistency by 37%. Model scaling from 0.5B to 7B parameters shows consistent improvements. On-vehicle road tests confirm real-time performance (99 ms latency) and successful urban deployment. By bridging interpretable reasoning with precise control, AR1 demonstrates a practical path towards Level 4 autonomous driving. Model weights are available at https://huggingface.co/nvidia/Alpamayo-R1-10B with inference code at https://github.com/NVlabs/alpamayo.

翻译：通过模仿学习训练的端到端架构通过扩展模型规模和数据推动了自动驾驶的发展，然而在安全关键的长尾场景中，其性能仍然脆弱，这些场景的监督信号稀疏且因果理解有限。我们提出了Alpamayo-R1（AR1），一种视觉-语言-行为模型，它将因果链推理与复杂驾驶场景的轨迹规划相结合。我们的方法具有三个关键创新：（1）因果链数据集，通过混合自动标注和人机协同流程构建，生成与驾驶行为对齐的、基于决策且因果关联的推理轨迹；（2）模块化的VLA架构，结合了为物理人工智能预训练的视觉语言模型Cosmos-Reason与一个基于扩散的轨迹解码器，该解码器能够实时生成动态可行的轨迹；（3）多阶段训练策略，使用监督微调来激发推理能力，并利用强化学习来强制推理-行为一致性并优化推理质量。与纯轨迹基线相比，AR1在具有挑战性的案例上实现了高达12%的规划精度提升，在闭环仿真中近距离遭遇率降低了35%。RL后训练将推理质量提高了45%，推理-行为一致性提高了37%。模型参数从0.5B扩展到7B显示出持续的性能改进。实车道路测试证实了其实时性能（99毫秒延迟）和成功的城市部署。通过将可解释的推理与精确控制相桥接，AR1展示了一条通向L4级自动驾驶的可行路径。模型权重可在 https://huggingface.co/nvidia/Alpamayo-R1-10B 获取，推理代码位于 https://github.com/NVlabs/alpamayo。