DEtection TRansformer (DETR) started a trend that uses a group of learnable queries for unified visual perception. This work begins by applying this appealing paradigm to LiDAR-based point cloud segmentation and obtains a simple yet effective baseline. Although the naive adaptation obtains fair results, the instance segmentation performance is noticeably inferior to previous works. By diving into the details, we observe that instances in the sparse point clouds are relatively small to the whole scene and often have similar geometry but lack distinctive appearance for segmentation, which are rare in the image domain. Considering instances in 3D are more featured by their positional information, we emphasize their roles during the modeling and design a robust Mixed-parameterized Positional Embedding (MPE) to guide the segmentation process. It is embedded into backbone features and later guides the mask prediction and query update processes iteratively, leading to Position-Aware Segmentation (PA-Seg) and Masked Focal Attention (MFA). All these designs impel the queries to attend to specific regions and identify various instances. The method, named Position-guided Point cloud Panoptic segmentation transFormer (P3Former), outperforms previous state-of-the-art methods by 3.4% and 1.2% PQ on SemanticKITTI and nuScenes benchmark, respectively. The source code and models are available at https://github.com/SmartBot-PJLab/P3Former .
翻译:DEtection TRansformer(DETR)开创了使用一组可学习查询实现统一视觉感知的趋势。本研究首先将此范式应用于基于LiDAR的点云分割,构建了一个简单而有效的基线模型。尽管直接适配取得了尚可的结果,但实例分割性能明显低于先前方法。深入分析后,我们观察到稀疏点云中的实例相对整个场景尺度较小,且常具有相似的几何结构但缺乏可用于分割的显著外观特征——这在图像域中较为少见。考虑到3D实例更依赖其位置信息进行特征表达,我们强调位置信息在建模过程中的作用,设计了一种鲁棒的混合参数化位置嵌入(MPE)来引导分割过程。该嵌入被融入骨干网络特征中,随后迭代引导掩码预测和查询更新过程,由此提出位置感知分割(PA-Seg)和掩码焦点注意力(MFA)。这些设计共同促使查询聚焦于特定区域并识别不同实例。本方法命名为位置引导点云全景分割Transformer(P3Former),在SemanticKITTI和nuScenes基准上分别以3.4%和1.2%的PQ指标超越先前最先进方法。源代码及模型已开源至https://github.com/SmartBot-PJLab/P3Former。