Vision-Language-Action (VLA) models have recently shown impressive generalization and language-guided manipulation capabilities. However, their performance degrades on tasks requiring precise spatial reasoning due to limited spatial reasoning inherited from Vision-Language Models (VLMs). Existing VLAs rely on extensive action-data pretraining to ground VLMs in 3D space, which reduces training efficiency and is still insufficient for accurate spatial understanding. In this work, we present DepthVLA, a simple yet effective VLA architecture that explicitly incorporates spatial awareness through a pretrained depth prediction module. DepthVLA adopts a mixture-of-transformers design that unifies a VLM, a depth transformer, and an action expert with fully shared attentions, forming an end-to-end model with enhanced spatial reasoning. Extensive evaluations in both real-world and simulated environments show that DepthVLA outperforms state-of-the-art approaches, achieving 78.5% vs. 65.0% progress in real-world tasks, 94.9% vs. 93.6% in the LIBERO simulator, and 74.8% vs. 58.8% in the Simpler simulator. Our code will be made publicly available.
翻译:视觉-语言-动作(VLA)模型近期展现出卓越的泛化能力和语言引导的操作性能。然而,由于继承自视觉-语言模型(VLM)的空间推理能力有限,其在需要精确空间推理的任务上表现欠佳。现有VLA模型依赖大量动作数据预训练将VLM映射至三维空间,这不仅降低了训练效率,且仍无法实现准确的空间理解。本研究提出DepthVLA——一种通过预训练深度预测模块显式融入空间感知的简洁高效VLA架构。该模型采用混合Transformer设计,通过全共享注意力机制将VLM、深度Transformer与动作专家模块相统一,构建出具有增强空间推理能力的端到端模型。在真实世界与仿真环境中的大量实验表明,DepthVLA性能优于现有最优方法:在真实任务中达成78.5% vs. 65.0%的进度提升,在LIBERO仿真器中达到94.9% vs. 93.6%,在Simpler仿真器中实现74.8% vs. 58.8%。我们将公开相关代码。