Evo-Depth: A Lightweight Depth-Enhanced Vision-Language-Action Model

Tao Lin,Yuxin Du,Jiting Liu,Nuobei Zhu,Yunhe Li,Yuqian Fu,Yinxinyu Chen,Hongyi Cai,Zewei Ye,Bing Cheng,Kai Ye,Yiran Mao,Yilei Zhong,MingKang Dong,Junchi Yan,Gen Li,Bo Zhao

Vision-Language-Action models have emerged as a promising paradigm for robotic manipulation by unifying perception, language grounding, and action generation. However, they often struggle in scenarios requiring precise spatial understanding, as current VLA models primarily rely on 2D visual representations that lack depth information and detailed spatial relationships. While recent approaches incorporate explicit 3D inputs such as depth maps or point clouds to address this issue, they often increase system complexity, require additional sensors, and remain vulnerable to sensing noise and reconstruction errors. Another line of work explores implicit 3D-aware spatial modeling directly from RGB observations without extra sensors, but it often relies on large geometry foundation models, resulting in higher training and deployment costs. To address these challenges, we propose Evo-Depth, a lightweight depth-enhanced VLA framework that enhances spatially grounded manipulation without relying on additional sensing hardware or compromising deployment efficiency. Evo-Depth employs a lightweight Implicit Depth Encoding Module to extract compact depth features from multi-view RGB images. These features are incorporated into vision-language representations through a Spatial Enhancement Module via depth-aware modulation, enabling efficient spatial-semantic enhancement. A Progressive Alignment Training strategy is further introduced to align the resulting depth-enhanced representations with downstream action learning. With only 0.9B parameters, Evo-Depth achieves superior performance across four simulation benchmarks. In real-world experiments, Evo-Depth attains the highest average success rate while also exhibiting the smallest model size, lowest GPU memory usage, and highest inference frequency among compared methods.

翻译：[translated abstract in Chinese] 视觉-语言-行为模型通过统一感知、语言理解和行为生成，已成为机器人操作任务中一种有前景的范式。然而，这些模型在需要精确空间理解的场景中常常表现不佳，因为当前的VLA模型主要依赖于缺乏深度信息和详细空间关系的2D视觉表示。虽然近期工作通过引入深度图或点云等显式3D输入来解决该问题，但这类方法往往增加系统复杂度、需要额外传感器，并且对传感噪声和重建误差较为敏感。另一类工作探索直接从RGB观测中建模隐式3D空间信息而无需额外传感器，但这通常依赖大规模几何基础模型，导致较高的训练和部署成本。为应对上述挑战，我们提出Evo-Depth——一种轻量级深度增强的VLA框架，在不依赖额外传感硬件或牺牲部署效率的前提下，增强空间感知操作能力。Evo-Depth采用轻量级隐式深度编码模块，从多视角RGB图像中提取紧凑的深度特征；通过空间增强模块中的深度感知调制，这些特征被融入视觉-语言表示，实现高效的空间-语义增强。此外，我们引入渐进式对齐训练策略，将所得的深度增强表示与下游行为学习对齐。仅含9亿参数的Evo-Depth在四个仿真基准测试中取得了卓越性能。在真实世界实验中，Evo-Depth在达到最高平均成功率的同时，还展现出对比方法中最小的模型尺寸、最低的GPU内存占用以及最高的推理频率。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【ICML 2026】 StableVLA：无需额外数据，基于信息瓶颈的自适应鲁棒性视觉-语言-动作模型

专知会员服务

6+阅读 · 5月19日

[ICML 2026] 看见的还是思考的？用奖励机制区分“看错”与“想错”：视觉语言模型奖励感知

专知会员服务

10+阅读 · 5月15日

视觉-语言-动作模型解析：从模块构成到里程碑与挑战

专知会员服务

17+阅读 · 2025年12月17日

面向具身操作的高效视觉–语言–动作模型：系统综述

专知会员服务

26+阅读 · 2025年10月22日