We propose octree-based transformers, named OctFormer, for 3D point cloud learning. OctFormer can not only serve as a general and effective backbone for 3D point cloud segmentation and object detection but also have linear complexity and is scalable for large-scale point clouds. The key challenge in applying transformers to point clouds is reducing the quadratic, thus overwhelming, computation complexity of attentions. To combat this issue, several works divide point clouds into non-overlapping windows and constrain attentions in each local window. However, the point number in each window varies greatly, impeding the efficient execution on GPU. Observing that attentions are robust to the shapes of local windows, we propose a novel octree attention, which leverages sorted shuffled keys of octrees to partition point clouds into local windows containing a fixed number of points while permitting shapes of windows to change freely. And we also introduce dilated octree attention to expand the receptive field further. Our octree attention can be implemented in 10 lines of code with open-sourced libraries and runs 17 times faster than other point cloud attentions when the point number exceeds 200k. Built upon the octree attention, OctFormer can be easily scaled up and achieves state-of-the-art performances on a series of 3D segmentation and detection benchmarks, surpassing previous sparse-voxel-based CNNs and point cloud transformers in terms of both efficiency and effectiveness. Notably, on the challenging ScanNet200 dataset, OctFormer outperforms sparse-voxel-based CNNs by 7.3 in mIoU. Our code and trained models are available at https://wang-ps.github.io/octformer.
翻译:我们提出基于八叉树的Transformer——OctFormer,用于三维点云学习。OctFormer不仅能作为通用且高效的骨干网络处理三维点云分割与目标检测任务,其线性复杂度还使其可扩展至大规模点云。将Transformer应用于点云的关键挑战在于降低注意力机制二次方、甚至过度庞大的计算复杂度。为应对此问题,若干研究将点云划分为非重叠窗口,并将注意力约束在各局部窗口内。然而,每个窗口内点数量差异巨大,阻碍了GPU上的高效执行。观察到注意力对局部窗口形状具有鲁棒性后,我们提出新型八叉树注意力机制:利用八叉树排序后的混洗键将点云划分为包含固定数量点、且形状可自由变化的局部窗口。同时引入扩张八叉树注意力以进一步扩大感受野。我们的八叉树注意力仅需10行开源库代码即可实现,当点数量超过20万时,其运行速度比其他点云注意力机制快17倍。基于八叉树注意力构建的OctFormer可轻松扩展,在多个三维分割与检测基准上达到最佳性能,在效率与效果上均超越此前基于稀疏体素的CNN和点云Transformer。值得注意的是,在极具挑战的ScanNet200数据集中,OctFormer的mIoU比基于稀疏体素的CNN高出7.3。代码与预训练模型已开源至https://wang-ps.github.io/octformer。