OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

Jinghui Lu,Jiayi Guan,Zhijian Huang,Jinlong Li,Guang Li,Lingdong Kong,Yingyan Li,Han Wang,Shaoqing Xu,Yuechen Luo,Fang Li,Chenxu Dang,Junli Wang,Tao Xu,Jing Wu,Jianhua Wu,Xiaoshuai Hao,Wen Zhang,Tianyi Jiang,Lingfeng Zhang,Lei Zhou,Yingbo Tang,Jie Wang,Yinfeng Gao,Xizhou Bu,Haochen Tian,Yihang Qiu,Feiyang Jia,Lin Liu,Yigu Ge,Hanbing Li,Yuannan Shen,Jianwei Cui,Hongwei Xie,Bing Wang,Haiyang Sun,Jingwei Zhao,Jiahui Huang,Pei Liu,Zeyu Zhu,Yuncheng Jiang,Zibin Guo,Chuhong Gong,Hanchao Leng,Kun Ma,Naiyang Wang,Guang Chen,Kuiyuan Yang,Hangjun Ye,Long Chen

from arxiv, Technical Report; 49 pages, 22 figures, 10 tables; Project Page at https://xiaomi-embodied-intelligence.github.io/OneVL

Chain-of-Thought (CoT) reasoning has become a powerful driver of trajectory prediction in VLA-based autonomous driving, yet its autoregressive nature imposes a latency cost that is prohibitive for real-time deployment. Latent CoT methods attempt to close this gap by compressing reasoning into continuous hidden states, but consistently fall short of their explicit counterparts. We suggest that this is due to purely linguistic latent representations compressing a symbolic abstraction of the world, rather than the causal dynamics that actually govern driving. Thus, we present OneVL (One-step latent reasoning and planning with Vision-Language explanations), a unified VLA and World Model framework that routes reasoning through compact latent tokens supervised by dual auxiliary decoders. Alongside a language decoder that reconstructs text CoT, we introduce a visual world model decoder that predicts future-frame tokens, forcing the latent space to internalize the causal dynamics of road geometry, agent motion, and environmental change. A three-stage training pipeline progressively aligns these latents with trajectory, language, and visual objectives, ensuring stable joint optimization. At inference, the auxiliary decoders are discarded and all latent tokens are prefilled in a single parallel pass, matching the speed of answer-only prediction. Across four benchmarks, OneVL becomes the first latent CoT method to surpass explicit CoT, delivering state-of-the-art accuracy at answer-only latency, and providing direct evidence that tighter compression, when guided in both language and world-model supervision, produces more generalizable representations than verbose token-by-token reasoning. Project Page: https://xiaomi-embodied-intelligence.github.io/OneVL

翻译：链式思维推理已成为基于视觉-语言-动作的自动驾驶中轨迹预测的强大驱动力，但其自回归特性带来的延迟成本使其难以应用于实时部署。潜在链式思维方法试图通过将推理压缩为连续隐状态来缩小这一差距，但其性能始终不及显式推理方法。我们认为，这是由于纯语言潜在表征压缩的是世界的符号抽象，而非实际控制驾驶的因果动态。为此，我们提出OneVL（单步潜在推理与视觉-语言解释规划），这是一个统一的视觉-语言-动作与世界模型框架，通过由双辅助解码器监督的紧凑潜在令牌来路由推理。除了重建文本链式思维的语言解码器外，我们还引入了一个视觉世界模型解码器，用于预测未来帧令牌，迫使潜在空间内化道路几何、智能体运动和环境变化的因果动态。三阶段训练流程逐步将这些潜在表征与轨迹、语言和视觉目标对齐，确保稳定的联合优化。推理时，辅助解码器被丢弃，所有潜在令牌在单次并行传递中预填充，匹配仅答案预测的速度。在四个基准测试中，OneVL成为首个超越显式链式思维的潜在链式思维方法，以仅答案预测的延迟实现了最先进的准确率，并直接证明当受语言和世界模型双重监督引导时，更紧凑的压缩能够比逐令牌的冗长推理产生更具泛化能力的表征。项目页面：https://xiaomi-embodied-intelligence.github.io/OneVL