Vision-language-action models have reshaped autonomous driving to incorporate languages into the decision-making process. However, most existing pipelines only utilize the language modality for scene descriptions or reasoning and lack the flexibility to follow diverse user instructions for personalized driving. To address this, we first construct a large-scale driving dataset (InstructScene) containing around 100,000 scenes annotated with diverse driving instructions with the corresponding trajectories. We then propose a unified Vision-Language-World-Action model, Vega, for instruction-based generation and planning. We employ the autoregressive paradigm to process visual inputs (vision) and language instructions (language) and the diffusion paradigm to generate future predictions (world modeling) and trajectories (action). We perform joint attention to enable interactions between the modalities and use individual projection layers for different modalities for more capabilities. Extensive experiments demonstrate that our method not only achieves superior planning performance but also exhibits strong instruction-following abilities, paving the way for more intelligent and personalized driving systems.
翻译:摘要:视觉-语言-动作模型已重塑自动驾驶领域,使语言得以融入决策过程。然而,现有多数流水线仅将语言模态用于场景描述或推理,缺乏遵循多样化用户指令以实现个性化驾驶的灵活性。为解决此问题,我们首先构建了一个大规模驾驶数据集(InstructScene),包含约10万个场景,每个场景均标注了多样化的驾驶指令及其对应轨迹。随后,我们提出统一视觉-语言-世界-动作模型Vega,用于基于指令的生成与规划。我们采用自回归范式处理视觉输入(视觉)与语言指令(语言),并采用扩散范式生成未来预测(世界建模)与轨迹(动作)。通过联合注意力机制实现模态间交互,并为不同模态采用独立投影层以增强能力。大量实验表明,我们的方法不仅实现了卓越的规划性能,还展现出强大的指令遵循能力,为更智能、更个性化的驾驶系统铺平了道路。