Robotic manipulation in 3D requires effective computation of N degree-of-freedom joint-space trajectories that enable precise and robust control. To achieve this, robots must integrate semantic understanding with visual perception to transform real-world observations into low-level control for object interaction. Recent advances in Vision-Language-Action (VLA) models have shown promise by mapping RGB images and language instructions to task space velocities, typically trained on large datasets of teleoperated demonstrations. However, these models often struggle with generalization beyond their training distributions. In this work, we introduce 3D-CAVLA, a novel finetuning framework that enhances task generalization of VLA policies by incorporating three key components: (i) chain-of-thought reasoning for structured decision-making, (ii) depth-aware perception for 3D spatial understanding, and (iii) task-oriented region-of-interest detection for focused manipulation. Extensive experiments in the LIBERO simulation environment demonstrate that 3D-CAVLA achieves an average success rate of 98.1% across diverse in-domain task suites. On unseen tasks, 3D-CAVLA delivers an absolute improvement of 8.8% in success rate, underscoring the benefits of 3D scene awareness for robust generalization. We validate our approach on real-world tabletop experiments demonstrating that the proposed model translates effectively from simulation to physical robots. 3D-CAVLA achieves over a 3X faster training convergence and delivers a 25% gain in success rate on unseen real world tasks. We will open-source our code and the unseen tasks dataset to promote community-driven research here: https://3d-cavla.github.io
翻译:三维机器人操作需要有效计算N自由度关节空间轨迹,以实现精准鲁棒的操控。为此,机器人必须将语义理解与视觉感知相结合,将真实世界观测转化为底层控制信号以完成物体交互。近年来,视觉-语言-动作(VLA)模型通过将RGB图像和语言指令映射至任务空间速度展现潜力,这类模型通常基于大规模遥操作演示数据集进行训练。然而,现有模型常难以超越训练数据分布实现泛化。本文提出3D-CAVLA——一种新型微调框架,通过整合三个关键组件增强VLA策略的任务泛化能力:(i)用于结构化决策的思维链推理,(ii)用于三维空间理解的深度感知机制,以及(iii)用于聚焦操控的任务导向感兴趣区域检测。在LIBERO仿真环境中的大量实验表明,3D-CAVLA在多样化领域内任务套件上平均成功率达98.1%。对于未见任务,3D-CAVLA的成功率绝对提升达8.8%,凸显三维场景感知在鲁棒泛化中的优势。我们通过真实桌面实验验证了该方法,证明所提模型能够有效从仿真迁移至物理机器人。3D-CAVLA的训练收敛速度提升超3倍,在真实世界未见任务上成功率提升25%。我们将开源代码和未见任务数据集以推动社区研究:https://3d-cavla.github.io