Lately, remarkable advancements of artificial intelligence have been attributed to the integration of self-supervised learning scheme. Despite impressive achievements within NLP, yet SSL in computer vision has not been able to stay on track comparatively. Recently, integration of contrastive learning on top of existing SSL models has established considerable progress in computer vision through which visual SSL models have outperformed their supervised counterparts. Nevertheless, most of these improvements were limited to classification tasks, and also, few works have been dedicated to evaluation of SSL models in real-world scenarios of computer vision, while the majority of works are centered around datasets containing class-wise portrait images, most notably, ImageNet. Consequently, in this work, we have considered dense prediction task of semantic segmentation in security inspection x-ray images to evaluate our proposed model Segmentation Localization. Based upon the model Instance Localization, our model SegLoc has managed to address one of the most challenging downsides of contrastive learning, i.e., false negative pairs of query embeddings. In order to do so, in contrast to baseline model InsLoc, our pretraining dataset is synthesized by cropping, transforming, then pasting already labeled segments from an available labeled dataset, foregrounds, onto instances of an unlabeled dataset, backgrounds. In our case, PIDray and SIXray datasets are considered as labeled and unlabeled datasets, respectively. Moreover, we fully harness labels by avoiding false negative pairs through implementing the idea, one queue per class, in MoCo-v2 whereby negative pairs corresponding to each query are extracted from its corresponding queue within the memory bank. Our approach has outperformed random initialization by 3% to 6%, while having underperformed supervised initialization.
翻译:近期,人工智能的显著进展归功于自监督学习方案的整合。尽管自监督学习在自然语言处理领域取得了令人瞩目的成就,但在计算机视觉中却未能保持同步发展。最近,在现有自监督学习模型基础上集成对比学习,已在计算机视觉领域取得了实质性进展,使视觉自监督模型超越了其监督学习对应模型。然而,这些改进大多局限于分类任务,且仅有少数研究致力于评估自监督模型在计算机视觉真实场景中的表现,而大多数工作集中于包含类别级肖像图像的数据集,尤其是ImageNet。因此,本研究考虑将安检X光图像中的语义分割密集预测任务作为评估我们提出的“分割定位”(Segmentation Localization)模型的场景。基于“实例定位”(Instance Localization)模型,我们的SegLoc模型成功解决了对比学习最具挑战性的缺陷之一——查询嵌入中的假负样本对。为此,与基线模型InsLoc不同,我们的预训练数据集通过裁剪、变换并将已有标注数据集(前景)中的标注片段粘贴到无标注数据集(背景)的实例上合成得到。在本研究中,PIDray和SIXray数据集分别被用作标注和无标注数据集。此外,我们通过在MoCo-v2中实现“每类一个队列”的思想,充分利用标签避免假负样本对,即从记忆库中对应的队列中提取每个查询对应的负样本对。我们的方法相比随机初始化性能提升了3%至6%,但略逊于监督初始化。