Recent self-supervised clustering-based pre-training techniques like DINO and Cribo have shown impressive results for downstream detection and segmentation tasks. However, real-world applications such as autonomous driving face challenges with imbalanced object class and size distributions and complex scene geometries. In this paper, we propose S3PT a novel scene semantics and structure guided clustering to provide more scene-consistent objectives for self-supervised training. Specifically, our contributions are threefold: First, we incorporate semantic distribution consistent clustering to encourage better representation of rare classes such as motorcycles or animals. Second, we introduce object diversity consistent spatial clustering, to handle imbalanced and diverse object sizes, ranging from large background areas to small objects such as pedestrians and traffic signs. Third, we propose a depth-guided spatial clustering to regularize learning based on geometric information of the scene, thus further refining region separation on the feature level. Our learned representations significantly improve performance in downstream semantic segmentation and 3D object detection tasks on the nuScenes, nuImages, and Cityscapes datasets and show promising domain translation properties.
翻译:近期基于自监督聚类的预训练技术(如DINO和Cribo)在下游检测与分割任务中展现出显著效果。然而,在自动驾驶等现实应用中,目标类别与尺寸分布的不均衡性以及复杂的场景几何结构带来了挑战。本文提出S3PT——一种新颖的场景语义与结构引导的聚类方法,为自监督训练提供更具场景一致性的优化目标。具体而言,我们的贡献包含三个方面:首先,我们引入语义分布一致性聚类,以提升对摩托车或动物等稀有类别的表征能力;其次,我们提出目标多样性一致的空间聚类,以处理从大范围背景区域到行人、交通标志等小目标的不均衡且多样化的物体尺寸分布;第三,我们设计深度引导的空间聚类,利用场景几何信息规范特征学习,从而在特征层面进一步优化区域划分。我们学习到的表征在nuScenes、nuImages和Cityscapes数据集的下游语义分割与3D目标检测任务中显著提升了性能,并展现出良好的域适应特性。