Current self-supervised learning (SSL) methods (e.g., SimCLR, DINO, VICReg,MOCOv3) target primarily on representations at instance level and do not generalize well to dense prediction tasks, such as object detection and segmentation.Towards aligning SSL with dense predictions, this paper demonstrates for the first time the underlying mean-shift clustering process of Vision Transformers (ViT), which aligns well with natural image semantics (e.g., a world of objects and stuffs). By employing transformer for joint embedding and clustering, we propose a two-level feature clustering SSL method, coined Feature-Level Self-supervised Learning (FLSL). We present the formal definition of the FLSL problem and construct the objectives from the mean-shift and k-means perspectives. We show that FLSL promotes remarkable semantic cluster representations and learns an embedding scheme amenable to intra-view and inter-view feature clustering. Experiments show that FLSL yields significant improvements in dense prediction tasks, achieving 44.9 (+2.8)% AP and 46.5% AP in object detection, as well as 40.8 (+2.3)% AP and 42.1% AP in instance segmentation on MS-COCO, using Mask R-CNN with ViT-S/16 and ViT-S/8 as backbone, respectively. FLSL consistently outperforms existing SSL methods across additional benchmarks, including UAV17 object detection on UAVDT, and video instance segmentation on DAVIS 2017.We conclude by presenting visualization and various ablation studies to better understand the success of FLSL. The source code is available at https://github.com/ISL-CV/FLSL.
翻译:当前自监督学习(SSL)方法(如SimCLR、DINO、VICReg、MOCOv3)主要针对实例级表征,难以泛化至密集预测任务(如目标检测与分割)。为弥合SSL与密集预测间的差距,本文首次论证了视觉Transformer(ViT)中与自然图像语义(如物体与材质世界)高度契合的均值漂移聚类过程。通过采用Transformer进行联合嵌入与聚类,我们提出一种双层特征聚类SSL方法——特征级自监督学习(FLSL)。本文给出FLSL问题的形式化定义,并从均值漂移与k-means视角构建目标函数。研究表明,FLSL能促进显著的语义聚类表征,并学习到适合视图内与视图间特征聚类的嵌入方案。实验表明,FLSL在密集预测任务中取得显著提升:以Mask R-CNN为框架,采用ViT-S/16与ViT-S/8主干网络时,在MS-COCO数据集上目标检测AP分别达44.9(+2.8)%与46.5%,实例分割AP分别达40.8(+2.3)%与42.1%。FLSL在多项基准测试中持续优于现有SSL方法,包括UAVDT上的UAV17目标检测与DAVIS 2017上的视频实例分割。最后,通过可视化与消融实验深入理解FLSL的成功机理。源代码已开源至https://github.com/ISL-CV/FLSL。