3D perceptual representations are well suited for robot manipulation as they easily encode occlusions and simplify spatial reasoning. Many manipulation tasks require high spatial precision in end-effector pose prediction, typically demanding high-resolution 3D perceptual grids that are computationally expensive to process. As a result, most manipulation policies operate directly in 2D, foregoing 3D inductive biases. In this paper, we propose Act3D, a manipulation policy Transformer that casts 6-DoF keypose prediction as 3D detection with adaptive spatial computation. It takes as input 3D feature clouds unprojected from one or more camera views, iteratively samples 3D point grids in free space in a coarse-to-fine manner, featurizes them using relative spatial attention to the physical feature cloud, and selects the best feature point for end-effector pose prediction. Act3D sets a new state-of-the-art in RLbench, an established manipulation benchmark. Our model achieves 10% absolute improvement over the previous SOTA 2D multi-view policy on 74 RLbench tasks and 22% absolute improvement with 3x less compute over the previous SOTA 3D policy. In thorough ablations, we show the importance of relative spatial attention, large-scale vision-language pre-trained 2D backbones, and weight tying across coarse-to-fine attentions. Code and videos are available at our project site: https://act3d.github.io/.
翻译:3D感知表示非常适合机器人操作,因其能有效编码遮挡关系并简化空间推理。许多操作任务要求对末端执行器位姿进行高精度预测,通常需要高分辨率的3D感知网格,但这类网格的计算成本极高。因此,大多数操作策略直接基于2D空间运行,放弃了3D归纳偏置。本文提出Act3D——一种将6自由度关键点位姿预测转化为具有自适应空间计算的3D检测任务的操作策略Transformer。该方法以从单或多视角反投影生成的3D特征点云为输入,通过粗到精的方式在自由空间中迭代采样3D点网格,利用相对空间注意力机制对物理特征点云进行特征化,并选取最优特征点用于末端执行器位姿预测。在权威操作基准RLbench上,Act3D取得了新最优结果。在74项RLbench任务中,我们的模型相较之前最优的2D多视角策略实现了10%的绝对性能提升;与之前最优的3D策略相比,在计算量减少3倍的情况下实现22%的绝对性能提升。通过系统消融实验,我们揭示了相对空间注意力、大规模视觉-语言预训练2D主干网络、以及粗到精注意力机制间权重共享的关键作用。代码与视频已发布至项目主页:https://act3d.github.io/。