We propose ADCLR: A ccurate and D ense Contrastive Representation Learning, a novel self-supervised learning framework for learning accurate and dense vision representation. To extract spatial-sensitive information, ADCLR introduces query patches for contrasting in addition with global contrasting. Compared with previous dense contrasting methods, ADCLR mainly enjoys three merits: i) achieving both global-discriminative and spatial-sensitive representation, ii) model-efficient (no extra parameters in addition to the global contrasting baseline), and iii) correspondence-free and thus simpler to implement. Our approach achieves new state-of-the-art performance for contrastive methods. On classification tasks, for ViT-S, ADCLR achieves 77.5% top-1 accuracy on ImageNet with linear probing, outperforming our baseline (DINO) without our devised techniques as plug-in, by 0.5%. For ViT-B, ADCLR achieves 79.8%, 84.0% accuracy on ImageNet by linear probing and finetune, outperforming iBOT by 0.3%, 0.2% accuracy. For dense tasks, on MS-COCO, ADCLR achieves significant improvements of 44.3% AP on object detection, 39.7% AP on instance segmentation, outperforming previous SOTA method SelfPatch by 2.2% and 1.2%, respectively. On ADE20K, ADCLR outperforms SelfPatch by 1.0% mIoU, 1.2% mAcc on the segme
翻译:我们提出ADCLR:一种精确且密集的对比表示学习框架,这是一种新颖的自监督学习框架,旨在学习精确且密集的视觉表示。为提取空间敏感信息,ADCLR在全局对比的基础上引入了查询补丁进行对比。与之前的密集对比方法相比,ADCLR主要具备三大优势:i)同时实现全局判别性和空间敏感性的表示,ii)模型高效(在全局对比基线基础上无需额外参数),以及iii)无需对应性,因此实现更简单。我们的方法在对比方法中达到了新的最先进性能。在分类任务中,对于ViT-S,ADCLR在线性探针评估下于ImageNet上实现了77.5%的top-1准确率,以即插即用方式超越基线方法(DINO)0.5%。对于ViT-B,ADCLR通过线性探针和微调在ImageNet上分别达到79.8%和84.0%的准确率,超越iBOT 0.3%和0.2%。在密集任务中,于MS-COCO数据集上,ADCLR在目标检测和实例分割任务中分别实现了44.3% AP和39.7% AP的显著提升,超越先前最先进方法SelfPatch 2.2%和1.2%。在ADE20K数据集上,ADCLR在语义分割任务中超越SelfPatch 1.0% mIoU和1.2% mAcc。