Recent self-supervised clustering-based pre-training techniques like DINO and Cribo have shown impressive results for downstream detection and segmentation tasks. However, real-world applications such as autonomous driving face challenges with imbalanced object class and size distributions and complex scene geometries. In this paper, we propose S3PT a novel scene semantics and structure guided clustering to provide more scene-consistent objectives for self-supervised training. Specifically, our contributions are threefold: First, we incorporate semantic distribution consistent clustering to encourage better representation of rare classes such as motorcycles or animals. Second, we introduce object diversity consistent spatial clustering, to handle imbalanced and diverse object sizes, ranging from large background areas to small objects such as pedestrians and traffic signs. Third, we propose a depth-guided spatial clustering to regularize learning based on geometric information of the scene, thus further refining region separation on the feature level. Our learned representations significantly improve performance in downstream semantic segmentation and 3D object detection tasks on the nuScenes, nuImages, and Cityscapes datasets and show promising domain translation properties.
翻译:近期基于自监督聚类的预训练技术(如DINO与Cribo)在下游检测与分割任务中展现出显著效果。然而,在自动驾驶等现实应用中,物体类别与尺寸分布不均衡以及复杂场景几何结构仍构成严峻挑战。本文提出S3PT——一种新颖的场景语义与结构引导聚类方法,为自监督训练提供更具场景一致性的优化目标。具体而言,我们的贡献包含三方面:首先,通过语义分布一致性聚类增强对摩托车、动物等稀有类别的表征能力;其次,引入物体多样性一致的空间聚类机制,以处理从大范围背景区域到行人、交通标志等小尺度物体的不均衡尺寸分布;第三,提出深度引导的空间聚类方法,利用场景几何信息规范特征学习,从而在特征层面进一步优化区域划分。实验表明,我们学习到的表征在nuScenes、nuImages和Cityscapes数据集的下游语义分割与3D物体检测任务中性能显著提升,并展现出优异的领域迁移特性。