LatentVLA：通过潜在动作预测实现自动驾驶的高效视觉-语言模型 (LatentVLA: Efficient Vision-Language Models for Autonomous Driving via Latent Action Prediction)

End-to-end autonomous driving models trained on largescale datasets perform well in common scenarios but struggle with rare, long-tail situations due to limited scenario diversity. Recent Vision-Language-Action (VLA) models leverage broad knowledge from pre-trained visionlanguage models to address this limitation, yet face critical challenges: (1) numerical imprecision in trajectory prediction due to discrete tokenization, (2) heavy reliance on language annotations that introduce linguistic bias and annotation burden, and (3) computational inefficiency from multi-step chain-of-thought reasoning hinders real-time deployment. We propose LatentVLA, a novel framework that employs self-supervised latent action prediction to train VLA models without language annotations, eliminating linguistic bias while learning rich driving representations from unlabeled trajectory data. Through knowledge distillation, LatentVLA transfers the generalization capabilities of VLA models to efficient vision-based networks, achieving both robust performance and real-time efficiency. LatentVLA establishes a new state-of-the-art on the NAVSIM benchmark with a PDMS score of 92.4 and demonstrates strong zeroshot generalization on the nuScenes benchmark.

翻译：在大规模数据集上训练的端到端自动驾驶模型在常见场景中表现良好，但由于场景多样性有限，在处理罕见的长尾情况时表现不佳。最近的视觉-语言-动作模型利用预训练视觉-语言模型的广泛知识来应对这一局限，但仍面临关键挑战：(1) 由于离散化标记导致的轨迹预测数值不精确，(2) 严重依赖引入语言偏见和标注负担的语言标注，以及(3) 多步思维链推理带来的计算低效阻碍了实时部署。我们提出了LatentVLA，这是一个新颖的框架，它采用自监督的潜在动作预测来训练VLA模型，无需语言标注，从而消除了语言偏见，同时从未标注的轨迹数据中学习丰富的驾驶表征。通过知识蒸馏，LatentVLA将VLA模型的泛化能力迁移到高效的基于视觉的网络中，实现了鲁棒性能和实时效率。LatentVLA在NAVSIM基准测试中以92.4的PDMS分数建立了新的最先进水平，并在nuScenes基准测试中展示了强大的零样本泛化能力。