3D perceptual representations are well suited for robot manipulation as they easily encode occlusions and simplify spatial reasoning. Many manipulation tasks require high spatial precision in end-effector pose prediction, which typically demands high-resolution 3D feature grids that are computationally expensive to process. As a result, most manipulation policies operate directly in 2D, foregoing 3D inductive biases. In this paper, we introduce Act3D, a manipulation policy transformer that represents the robot's workspace using a 3D feature field with adaptive resolutions dependent on the task at hand. The model lifts 2D pre-trained features to 3D using sensed depth, and attends to them to compute features for sampled 3D points. It samples 3D point grids in a coarse to fine manner, featurizes them using relative-position attention, and selects where to focus the next round of point sampling. In this way, it efficiently computes 3D action maps of high spatial resolution. Act3D sets a new state-of-the-art in RL-Bench, an established manipulation benchmark, where it achieves 10% absolute improvement over the previous SOTA 2D multi-view policy on 74 RLBench tasks and 22% absolute improvement with 3x less compute over the previous SOTA 3D policy. We quantify the importance of relative spatial attention, large-scale vision-language pre-trained 2D backbones, and weight tying across coarse-to-fine attentions in ablative experiments. Code and videos are available on our project website: https://act3d.github.io/.
翻译:三维感知表示非常适合机器人操控,因其能有效编码遮挡关系并简化空间推理。许多操控任务在末端执行器姿态预测中需要高空间精度,这通常要求高分辨率的三维特征网格,但其计算成本高昂。因此,大多数操控策略直接基于二维空间运行,放弃了三维归纳偏置。本文提出Act3D——一种基于变换器的操控策略,通过根据具体任务自适应分辨率的3D特征场表示机器人工作空间。该模型利用感知深度将二维预训练特征提升至三维空间,并通过注意力机制为采样的三维点计算特征。它以粗到精的方式采样三维点网格,使用相对位置注意力进行特征化,并选择下一轮点采样的聚焦区域。通过这种方式,高效计算具有高空间分辨率的三维动作图。在权威操控基准RL-Bench上,Act3D取得了新最佳性能:在74个RLBench任务中,相比先前最优的二维多视图策略实现10%的绝对性能提升;相比先前最优的三维策略,在计算量减少3倍的情况下实现22%的绝对性能提升。我们通过消融实验定量分析了相对空间注意力、大规模视觉-语言预训练二维骨干网络以及粗到精注意力权重共享的重要性。代码与视频见项目网站:https://act3d.github.io/。