GEAR-VLA: Learning Geometry-Aware Action Representations for Generalizable Robotic Manipulation

Yuan Zhang,Shiqi Zhang,Yedong Shen,Shuai Dong,Jiajun Deng,Xin Zhang,Yuxuan Gao,Jiajia Wu,Xin Nie,Zhiyuan Cheng,Jianmin Ji,Yanyong Zhang,Xingyi Zhang,Jia Pan

Vision-Language-Action (VLA) models achieve strong benchmark performance but still struggle in real-world deployment with unseen objects, background shifts, and different robot embodiments. We argue that this stems from the lack of a unified geometry-aware manipulation representation, leaving existing VLAs vulnerable to low-level trajectory supervision, misaligned 3D features, and embodiment differences. To address this, we propose GEAR-VLA, a VLA framework for learning unified geometry-aware action representations for generalizable robotic manipulation. GEAR-VLA adopts coarse-to-fine action learning, where multi-source embodied pretraining equips the VLM with embodied reasoning and discrete action understanding before latent action tokens connect action semantics to a gradient-decoupled DiT continuous action expert. It further performs semantic-aligned 3D integration by aligning a trainable 3D spatial backbone with the VLA representation while freezing the original VLM-aligned visual pathway. To share this representation across robots, GEAR-VLA uses embodiment canonicalization, where embodiment-aware states and embodiment-invariant actions confine robot differences to the low-level interface. Extensive simulation and real-world experiments demonstrate strong generalization: GEAR-VLA achieves state-of-the-art performance on LIBERO, zero-shot LIBERO-Plus, and RoboTwin 2.0, reaches 85.9% success on AgileX and 81.0% on the pretraining-unseen LDT-01 embodiment, and obtains 90.1% success on a 6,360-trial universal grasping benchmark with 212 unseen objects. Code and models will be released at https://github.com/babynabeauty/GEAR-VLA.

翻译：[视觉-语言-动作（VLA）模型在标准基准测试中取得了强劲性能，但在真实场景部署中仍面临挑战，难以泛化至未见物体、背景变化及不同机器人形态。我们认为，其根源在于缺乏统一的几何感知操作表征，导致现有VLA模型易受低层级轨迹监督、未对齐的3D特征以及形态差异影响。为此，我们提出GEAR-VLA——一种用于学习统一几何感知动作表征以实现泛化机器人操作的VLA框架。GEAR-VLA采用从粗到精的动作学习策略：多源具身预训练赋予VLM具身推理与离散动作理解能力，随后潜在动作标记将动作语义连接至梯度解耦的连续动作专家模型DiT；同时，通过冻结原始VLM对齐的视觉通路，将可训练的3D空间骨干网络与VLA表征进行语义对齐的3D融合。为跨机器人共享该表征，GEAR-VLA采用形态规范化方法，利用形态感知状态与形态不变动作将机器人差异限定至低层级接口。大量仿真与真实实验验证了其强大泛化能力：GEAR-VLA在LIBERO、零样本LIBERO-Plus及RoboTwin 2.0上达到最优性能，在AgileX上取得85.9%成功率，在预训练未见的LDT-01形态上达81.0%，并在涵盖212个未见物体的6,360次通用抓取基准测试中获得90.1%成功率。代码与模型将发布于https://github.com/babynabeauty/GEAR-VLA。]