We propose DynVLA, a driving VLA model that introduces a new CoT paradigm termed Dynamics CoT. DynVLA forecasts compact world dynamics before action generation, enabling more informed and physically grounded decision-making. To obtain compact dynamics representations, DynVLA introduces a Dynamics Tokenizer that compresses future evolution into a small set of dynamics tokens. Considering the rich environment dynamics in interaction-intensive driving scenarios, DynVLA decouples ego-centric and environment-centric dynamics, yielding more accurate world dynamics modeling. We then train DynVLA to generate dynamics tokens before actions through SFT and RFT, improving decision quality while maintaining latency-efficient inference. Compared to Textual CoT, which lacks fine-grained spatiotemporal understanding, and Visual CoT, which introduces substantial redundancy due to dense image prediction, Dynamics CoT captures the evolution of the world in a compact, interpretable, and efficient form. Extensive experiments on NAVSIM, Bench2Drive, and a large-scale in-house dataset demonstrate that DynVLA consistently outperforms Textual CoT and Visual CoT methods, validating the effectiveness and practical value of Dynamics CoT.
翻译:我们提出DynVLA,一种引入新型思维链范式(称为动态思维链)的驾驶视觉语言模型。DynVLA在行为生成前预测紧凑的世界动态,从而实现更具信息性和物理依据的决策。为获得紧凑的动态表征,DynVLA引入了动态分词器,将未来演化压缩为少量动态令牌。考虑到交互密集型驾驶场景中丰富的环境动态,DynVLA解耦了以自我为中心和以环境为中心的动态,实现了更精确的世界动态建模。随后我们通过监督微调和强化微调训练DynVLA在行为生成前产生动态令牌,在保持低延迟推理的同时提升决策质量。相较于缺乏细粒度时空理解的文本思维链,以及因密集图像预测引入显著冗余的视觉思维链,动态思维链以紧凑、可解释且高效的形式捕捉世界演化过程。在NAVSIM、Bench2Drive及大规模内部数据集上的广泛实验表明,DynVLA持续优于文本思维链和视觉思维链方法,验证了动态思维链的有效性与实用价值。