A key challenge for LiDAR-based 3D object detection is to capture sufficient features from large scale 3D scenes especially for distant or/and occluded objects. Albeit recent efforts made by Transformers with the long sequence modeling capability, they fail to properly balance the accuracy and efficiency, suffering from inadequate receptive fields or coarse-grained holistic correlations. In this paper, we propose an Octree-based Transformer, named OcTr, to address this issue. It first constructs a dynamic octree on the hierarchical feature pyramid through conducting self-attention on the top level and then recursively propagates to the level below restricted by the octants, which captures rich global context in a coarse-to-fine manner while maintaining the computational complexity under control. Furthermore, for enhanced foreground perception, we propose a hybrid positional embedding, composed of the semantic-aware positional embedding and attention mask, to fully exploit semantic and geometry clues. Extensive experiments are conducted on the Waymo Open Dataset and KITTI Dataset, and OcTr reaches newly state-of-the-art results.
翻译:基于激光雷达的三维目标检测的关键挑战在于如何从大规模三维场景中充分捕捉特征,尤其是针对远处或/和遮挡目标。尽管近期Transformer凭借长序列建模能力取得了进展,但它们在精度与效率之间未能实现合理平衡,存在感受野不足或粗粒度全局关联的问题。本文提出了一种基于八叉树的Transformer——OcTr来解决该问题。该方法首先在分层特征金字塔上构建动态八叉树,通过对顶层执行自注意力机制,然后递归传播至受八分体约束的下一层,以由粗到细的方式捕获丰富的全局上下文,同时将计算复杂度控制在合理范围内。此外,为了增强前景感知能力,我们提出了一种混合位置编码(由语义感知位置编码和注意力掩码组成),以充分利用语义与几何线索。在Waymo开放数据集和KITTI数据集上进行的广泛实验表明,OcTr达到了新的最佳性能。