Action Emergence from Streaming Intent

We formalize action emergence as a target capability for end-to-end autonomous driving: the ability to generate physically feasible, semantically appropriate, and safety-compliant actions in arbitrary, long-tail traffic scenes through scene-conditioned reasoning rather than retrieval or interpolation of learned scene-action mappings. We show that previous paradigms cannot deliver action emergence: autoregressive trajectory decoders collapse the inherently multimodal future into a single averaged output, while diffusion and flow-matching generators express multimodality but are not steerable by reasoned intent. We propose Streaming Intent as a concrete way to approach action emergence: a mechanism that makes driving intent (i) semantically streamed through a continuous chain-of-thought that causally derives the intent from scene understanding, and (ii) temporally streamed across clips so that intent commitments remain coherent along the driving horizon. We realize Streaming Intent in a VLA model we call SI (Streaming Intent). SI autoregressively decodes a four-step chain-of-thought and emits an intent token; the decoded intent then drives classifier-free guidance (CFG) on a flow-matching action head, requiring only two denoising steps to generate the final trajectory. On the Waymo End-to-End benchmark, SI achieves competitive aggregate performance, with an RFS score of 7.96 on the validation set and 7.74 on the test set. Beyond aggregate metrics, the model demonstrates -- to our knowledge for the first time in a fully end-to-end VLA -- intent-faithful controllability: for a fixed scene, varying the intent class at inference yields qualitatively distinct yet consistently high-quality plans, arising purely from data-driven learning without any pre-built trajectory bank or hand-coded post-hoc selector.

翻译：我们将行动涌现形式化为端到端自动驾驶的一个目标能力：即在任意长尾交通场景中，通过场景条件推理生成物理可行、语义适当且符合安全的行动，而非依赖于检索或插值已学习的场景-行动映射。我们表明，先前的范式无法实现行动涌现：自回归轨迹解码器将内在多模态的未来压缩为单一平均化输出，而扩散和流匹配生成器虽能表达多模态性，却无法通过推理意图进行引导。我们提出流式意图作为实现行动涌现的具体途径：该机制使驾驶意图（i）在语义上通过连续的思维链进行流式传递，该思维链从场景理解中因果推导出意图；（ii）在时间上跨片段流式传递，确保意图承诺在驾驶范围内保持一致。我们在一个名为SI（流式意图）的VLA模型中实现了流式意图。SI自回归解码一个四步思维链并输出一个意图标记；解码后的意图随后引导流匹配动作头上的无分类器引导（CFG），仅需两步去噪即可生成最终轨迹。在Waymo端到端基准测试中，SI取得了具有竞争力的综合性能，在验证集上RFS得分为7.96，测试集上为7.74。除了综合指标外，该模型展示了——据我们所知，在完全端到端VLA中首次——意图忠实可控性：对于固定场景，推理时改变意图类别会产生质量各异且一致优质的计划，这完全源于数据驱动学习，无需任何预构建轨迹库或手工编码的后验选择器。