Autonomous cars need geometric accuracy and semantic understanding to navigate complex environments, yet most stacks handle them separately. We present XYZ-Drive, a single vision-language model that reads a front-camera frame, a 25m $\times$ 25m overhead map, and the next waypoint, then outputs steering and speed. A lightweight goal-centered cross-attention layer lets waypoint tokens highlight relevant image and map patches, supporting both action and textual explanations, before the fused tokens enter a partially fine-tuned LLaMA-3.2 11B model. On the MD-NEX Outdoor-Driving benchmark XYZ-Drive attains 95% success and 0.80 Success weighted by Path Length (SPL), surpassing PhysNav-DG by 15%. and halving collisions, all while significantly improving efficiency by using only a single branch. Sixteen ablations explain the gains. Removing any modality (vision, waypoint, map) drops success by up to 11%, confirming their complementary roles and rich connections. Replacing goal-centered attention with simple concatenation cuts 3% in performance, showing query-based fusion injects map knowledge more effectively. Keeping the transformer frozen loses 5%, showing the importance of fine-tuning when applying VLMs for specific tasks such as autonomous driving. Coarsening map resolution from 10 cm to 40 cm blurs lane edges and raises crash rate. Overall, these results demonstrate that early, token-level fusion of intent and map layout enables accurate, transparent, real-time driving.
翻译:自动驾驶汽车需要几何精度和语义理解以在复杂环境中导航,然而大多数系统栈将它们分开处理。我们提出了XYZ-Drive,这是一个单一的视觉-语言模型,它读取前摄像头帧、一个25米×25米的俯视地图以及下一个路径点,然后输出转向和速度。一个轻量级的以目标为中心的交叉注意力层让路径点标记突出显示相关的图像和地图块,在融合后的标记进入部分微调的LLaMA-3.2 11B模型之前,同时支持动作和文本解释。在MD-NEX Outdoor-Driving基准测试中,XYZ-Drive实现了95%的成功率和0.80的路径长度加权成功率(SPL),超越了PhysNav-DG 15%,并将碰撞次数减半,同时通过仅使用单一分支显著提高了效率。十六项消融实验解释了这些收益。移除任何模态(视觉、路径点、地图)会使成功率下降高达11%,证实了它们的互补作用和丰富联系。用简单拼接替换以目标为中心的注意力会使性能下降3%,表明基于查询的融合能更有效地注入地图知识。保持Transformer冻结会损失5%,显示了在将视觉-语言模型应用于特定任务(如自动驾驶)时微调的重要性。将地图分辨率从10厘米粗化到40厘米会使车道边缘模糊并提高碰撞率。总体而言,这些结果表明,意图和地图布局在早期标记级别的融合能够实现准确、透明、实时的驾驶。