Vision Language Models (VLMs) bridge visual perception and linguistic reasoning. In Autonomous Driving (AD), this synergy has enabled Vision Language Action (VLA) models, which translate high-level multimodal understanding into driving behaviors, typically represented as future trajectories. However, existing VLA models mainly generate generic collision-free trajectories. Beyond collision avoidance, adapting to diverse driving styles (e.g., sporty, comfortable) is essential for personalized driving. Moreover, many methods treat trajectory generation as naive token prediction, which can produce kinematically infeasible actions. To address these limitations, we present StyleVLA, a physics-informed VLA framework for generating diverse and physically plausible driving behaviors. We introduce a hybrid loss that combines a kinematic consistency constraint with a continuous regression head to improve trajectory feasibility. To train StyleVLA, built on Qwen3-VL-4B, we construct a large-scale instruction dataset with over 1.2k scenarios, 76k Bird's Eye View (BEV) samples, and 42k First Person View (FPV) samples, with ground-truth trajectories for five driving styles and natural-language instructions. Experiments show that our 4B-parameter StyleVLA significantly outperforms proprietary models (e.g., Gemini-3-Pro) and state-of-the-art VLA models. Using a composite driving score measuring success rate, physical feasibility, and style adherence, StyleVLA achieves 0.55 on BEV and 0.51 on FPV, versus 0.32 and 0.35 for Gemini-3-Pro. These results show that a specialized, physics-informed, lightweight model can surpass closed-source models on domain-specific tasks.
翻译:视觉语言模型(VLMs)架起了视觉感知与语言推理之间的桥梁。在自动驾驶(AD)领域,这种协同作用催生了视觉语言动作(VLA)模型,其能够将高级多模态理解转化为驾驶行为,通常表现为未来轨迹。然而,现有的VLA模型主要生成通用的无碰撞轨迹。除了避免碰撞,适应多样化的驾驶风格(例如运动型、舒适型)对于个性化驾驶至关重要。此外,许多方法将轨迹生成视为简单的令牌预测,这可能导致产生运动学上不可行的动作。为解决这些局限性,我们提出了StyleVLA,一个基于物理信息的VLA框架,用于生成多样化且物理上合理的驾驶行为。我们引入了一种混合损失函数,将运动学一致性约束与连续回归头相结合,以提高轨迹的可行性。为了训练基于Qwen3-VL-4B构建的StyleVLA,我们构建了一个大规模指令数据集,包含超过1.2千个场景、7.6万个鸟瞰图(BEV)样本和4.2万个第一人称视角(FPV)样本,并为五种驾驶风格及自然语言指令提供了真实轨迹。实验表明,我们的40亿参数StyleVLA模型显著优于专有模型(例如Gemini-3-Pro)以及最先进的VLA模型。使用一个综合衡量成功率、物理可行性和风格遵循度的驾驶评分,StyleVLA在BEV上达到0.55分,在FPV上达到0.51分,而Gemini-3-Pro的得分分别为0.32和0.35。这些结果表明,一个专业的、基于物理信息的轻量级模型能够在特定领域任务上超越闭源模型。