Sparse2Act: Learning Action-Aligned Sparse 3D Representations for Cross-Domain Robot Manipulation

Explicit 3D representations are attractive for manipulation because they expose object shape, workspace geometry, and robot-object relations in metric coordinates. However, sparse 3D encoders are often learned through downstream task objectives, tying the representation to a particular data distribution, policy architecture, and action parameterization. We introduce Sparse2Act, an observation-action alignment framework for pretraining sparse point-cloud encoders. The key idea is to use task-space end-effector actions as geometric supervision: masked sparse 3D tokens are trained to organize scene features around the workspace motion paired with the observation. After pretraining, only the encoder initialization is reused by downstream policies, allowing them to retain their own architectures and action spaces, including joint-space commands. On the LIBERO-10 benchmark, our method achieves 86.9% average success after 500 fine-tuning steps. The same pretrained encoder supports LIBERO-to-Meta-World cross-domain transfer, achieving 73.4% average success on the Meta-World-5 benchmark. Ablations on the objective and decoder capacity show that the gains come from the masked action-alignment signal and remain useful across downstream action decoders. In real-world experiments, simulation pretraining followed by limited real-data fine-tuning achieves an average success rate of 72.5% across four tasks, demonstrating effective sim-to-real transfer. These results suggest that robot actions can provide compact geometric supervision for reusable sparse 3D representations.

翻译：显式三维表示因其以度量坐标形式揭示物体形状、工作空间几何以及机器人与物体之间的关系而受到操作任务的青睐。然而，稀疏三维编码器通常通过下游任务目标进行学习，将其表示与特定的数据分布、策略架构和动作参数化绑定。我们提出Sparse2Act，一种用于预训练稀疏点云编码器的观测-动作对齐框架。其核心思想是利用任务空间末端执行器动作作为几何监督：被掩码的稀疏三维标记被训练以组织与观测配对的工作空间运动相关的场景特征。预训练后，下游策略仅复用编码器的初始化，从而保留其自身的架构和动作空间（包括关节空间指令）。在LIBERO-10基准上，该方法经过500步微调后平均成功率达86.9%。同一预训练编码器支持LIBERO到Meta-World的跨域迁移，在Meta-World-5基准上平均成功率达73.4%。对目标和解码器容量的消融实验表明，性能提升源于掩码动作对齐信号，且该信号在下游动作解码器中保持有效性。在真实世界实验中，仿真预训练结合少量真实数据微调在四个任务上平均成功率达72.5%，展示了高效的仿真到现实迁移能力。这些结果表明，机器人动作可为可复用的稀疏三维表示提供紧凑的几何监督。