Vision-Language Cross-Attention for Real-Time Autonomous Driving

Autonomous cars need geometric accuracy and semantic understanding to navigate complex environments, yet most stacks handle them separately. We present XYZ-Drive, a single vision-language model that reads a front-camera frame, a 25m $\times$ 25m overhead map, and the next waypoint, then outputs steering and speed. A lightweight goal-centered cross-attention layer lets waypoint tokens highlight relevant image and map patches, supporting both action and textual explanations, before the fused tokens enter a partially fine-tuned LLaMA-3.2 11B model. On the MD-NEX Outdoor-Driving benchmark XYZ-Drive attains 95% success and 0.80 Success weighted by Path Length (SPL), surpassing PhysNav-DG by 15%. and halving collisions, all while significantly improving efficiency by using only a single branch. Sixteen ablations explain the gains. Removing any modality (vision, waypoint, map) drops success by up to 11%, confirming their complementary roles and rich connections. Replacing goal-centered attention with simple concatenation cuts 3% in performance, showing query-based fusion injects map knowledge more effectively. Keeping the transformer frozen loses 5%, showing the importance of fine-tuning when applying VLMs for specific tasks such as autonomous driving. Coarsening map resolution from 10 cm to 40 cm blurs lane edges and raises crash rate. Overall, these results demonstrate that early, token-level fusion of intent and map layout enables accurate, transparent, real-time driving.

翻译：自动驾驶汽车需要几何精度和语义理解以在复杂环境中导航，然而大多数系统栈将它们分开处理。我们提出了XYZ-Drive，这是一个单一的视觉-语言模型，它读取前摄像头帧、一个25米×25米的俯视地图以及下一个路径点，然后输出转向和速度。一个轻量级的以目标为中心的交叉注意力层让路径点标记突出显示相关的图像和地图块，在融合后的标记进入部分微调的LLaMA-3.2 11B模型之前，同时支持动作和文本解释。在MD-NEX Outdoor-Driving基准测试中，XYZ-Drive实现了95%的成功率和0.80的路径长度加权成功率(SPL)，超越了PhysNav-DG 15%，并将碰撞次数减半，同时通过仅使用单一分支显著提高了效率。十六项消融实验解释了这些收益。移除任何模态（视觉、路径点、地图）会使成功率下降高达11%，证实了它们的互补作用和丰富联系。用简单拼接替换以目标为中心的注意力会使性能下降3%，表明基于查询的融合能更有效地注入地图知识。保持Transformer冻结会损失5%，显示了在将视觉-语言模型应用于特定任务（如自动驾驶）时微调的重要性。将地图分辨率从10厘米粗化到40厘米会使车道边缘模糊并提高碰撞率。总体而言，这些结果表明，意图和地图布局在早期标记级别的融合能够实现准确、透明、实时的驾驶。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

生成性对抗网络:理论模型、评估指标和最近发展的概述，Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments

专知会员服务

42+阅读 · 2020年5月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日