Existing Vision-Language-Action (VLA) models typically take 2D images as visual input, which limits their spatial understanding in complex scenes. How can we incorporate 3D information to enhance VLA capabilities? We conduct a pilot study across different observation spaces and visual representations. The results show that explicitly lifting visual input into point clouds yields representations that better complement their corresponding 2D representations. To address the challenges of (1) scarce 3D data and (2) the domain gap induced by cross-environment differences and depth-scale biases, we propose Any3D-VLA. It unifies the simulator, sensor, and model-estimated point clouds within a training pipeline, constructs diverse inputs, and learns domain-agnostic 3D representations that are fused with the corresponding 2D representations. Simulation and real-world experiments demonstrate Any3D-VLA's advantages in improving performance and mitigating the domain gap. Our project homepage is available at https://xianzhefan.github.io/Any3D-VLA.github.io.
翻译:现有视觉-语言-动作模型通常以二维图像作为视觉输入,这限制了其在复杂场景中的空间理解能力。如何融合三维信息增强VLA能力?我们在不同观测空间和视觉表征下开展了一项先导研究,结果表明将视觉输入显式提升为点云所获得的表征,能更好地补全其对应的二维表征。针对(1)三维数据稀缺及(2)跨环境差异与深度尺度偏差引发的领域鸿沟这两大挑战,我们提出Any3D-VLA。该框架将模拟器、传感器及模型估计的点云统一至训练流程中,构建多样化输入,学习领域无关的三维表征,并将其与对应的二维表征融合。仿真与实际实验均表明Any3D-VLA在提升性能与弥合领域鸿沟方面的优势。项目主页访问链接:https://xianzhefan.github.io/Any3D-VLA.github.io