Vision-Language-Action models have emerged as a promising paradigm for robotic manipulation by unifying perception, language grounding, and action generation. However, they often struggle in scenarios requiring precise spatial understanding, as current VLA models primarily rely on 2D visual representations that lack depth information and detailed spatial relationships. While recent approaches incorporate explicit 3D inputs such as depth maps or point clouds to address this issue, they often increase system complexity, require additional sensors, and remain vulnerable to sensing noise and reconstruction errors. Another line of work explores implicit 3D-aware spatial modeling directly from RGB observations without extra sensors, but it often relies on large geometry foundation models, resulting in higher training and deployment costs. To address these challenges, we propose Evo-Depth, a lightweight depth-enhanced VLA framework that enhances spatially grounded manipulation without relying on additional sensing hardware or compromising deployment efficiency. Evo-Depth employs a lightweight Implicit Depth Encoding Module to extract compact depth features from multi-view RGB images. These features are incorporated into vision-language representations through a Spatial Enhancement Module via depth-aware modulation, enabling efficient spatial-semantic enhancement. A Progressive Alignment Training strategy is further introduced to align the resulting depth-enhanced representations with downstream action learning. With only 0.9B parameters, Evo-Depth achieves superior performance across four simulation benchmarks. In real-world experiments, Evo-Depth attains the highest average success rate while also exhibiting the smallest model size, lowest GPU memory usage, and highest inference frequency among compared methods.
翻译:[translated abstract in Chinese]
视觉-语言-行为模型通过统一感知、语言理解和行为生成,已成为机器人操作任务中一种有前景的范式。然而,这些模型在需要精确空间理解的场景中常常表现不佳,因为当前的VLA模型主要依赖于缺乏深度信息和详细空间关系的2D视觉表示。虽然近期工作通过引入深度图或点云等显式3D输入来解决该问题,但这类方法往往增加系统复杂度、需要额外传感器,并且对传感噪声和重建误差较为敏感。另一类工作探索直接从RGB观测中建模隐式3D空间信息而无需额外传感器,但这通常依赖大规模几何基础模型,导致较高的训练和部署成本。为应对上述挑战,我们提出Evo-Depth——一种轻量级深度增强的VLA框架,在不依赖额外传感硬件或牺牲部署效率的前提下,增强空间感知操作能力。Evo-Depth采用轻量级隐式深度编码模块,从多视角RGB图像中提取紧凑的深度特征;通过空间增强模块中的深度感知调制,这些特征被融入视觉-语言表示,实现高效的空间-语义增强。此外,我们引入渐进式对齐训练策略,将所得的深度增强表示与下游行为学习对齐。仅含9亿参数的Evo-Depth在四个仿真基准测试中取得了卓越性能。在真实世界实验中,Evo-Depth在达到最高平均成功率的同时,还展现出对比方法中最小的模型尺寸、最低的GPU内存占用以及最高的推理频率。