Vision-Language-Action (VLA) models have recently achieved remarkable progress in robotic perception and control, yet most existing approaches primarily rely on VLM trained using 2D images, which limits their spatial understanding and action grounding in complex 3D environments. To address this limitation, we propose a novel framework that integrates depth estimation into VLA models to enrich 3D feature representations. Specifically, we employ a depth estimation baseline called VGGT to extract geometry-aware 3D cues from standard RGB inputs, enabling efficient utilization of existing large-scale 2D datasets while implicitly recovering 3D structural information. To further enhance the reliability of these depth-derived features, we introduce a new module called action assistant, which constrains the learned 3D representations with action priors and ensures their consistency with downstream control tasks. By fusing the enhanced 3D features with conventional 2D visual tokens, our approach significantly improves the generalization ability and robustness of VLA models. Experimental results demonstrate that the proposed method not only strengthens perception in geometrically ambiguous scenarios but also leads to superior action prediction accuracy. This work highlights the potential of depth-driven data augmentation and auxiliary expert supervision for bridging the gap between 2D observations and 3D-aware decision-making in robotic systems.
翻译:视觉-语言-动作模型在机器人感知与控制领域已取得显著进展,然而现有方法大多依赖基于二维图像训练的视觉语言模型,这限制了其在复杂三维环境中的空间理解与动作定位能力。为克服这一局限,本文提出一种将深度估计融入VLA模型以丰富三维特征表示的新框架。具体而言,我们采用名为VGGT的深度估计基线,从标准RGB输入中提取几何感知的三维线索,从而在高效利用现有大规模二维数据集的同时,隐式恢复三维结构信息。为进一步提升深度特征的可信度,我们引入名为动作辅助器的新模块,该模块通过动作先验约束学习到的三维表示,并确保其与下游控制任务的一致性。通过将增强的三维特征与传统的二维视觉标记相融合,本方法显著提升了VLA模型的泛化能力与鲁棒性。实验结果表明,所提方法不仅增强了模型在几何模糊场景下的感知能力,同时实现了更优的动作预测精度。本研究揭示了深度驱动数据增强与辅助专家监督在弥合机器人系统中二维观测与三维感知决策间鸿沟的潜力。