This paper presents Contrastive Transformer, a contrastive learning scheme using the Transformer innate patches. Contrastive Transformer enables existing contrastive learning techniques, often used for image classification, to benefit dense downstream prediction tasks such as semantic segmentation. The scheme performs supervised patch-level contrastive learning, selecting the patches based on the ground truth mask, subsequently used for hard-negative and hard-positive sampling. The scheme applies to all vision-transformer architectures, is easy to implement, and introduces minimal additional memory footprint. Additionally, the scheme removes the need for huge batch sizes, as each patch is treated as an image. We apply and test Contrastive Transformer for the case of aerial image segmentation, known for low-resolution data, large class imbalance, and similar semantic classes. We perform extensive experiments to show the efficacy of the Contrastive Transformer scheme on the ISPRS Potsdam aerial image segmentation dataset. Additionally, we show the generalizability of our scheme by applying it to multiple inherently different Transformer architectures. Ultimately, the results show a consistent increase in mean IoU across all classes.
翻译:本文提出对比Transformer(Contrastive Transformer),一种利用Transformer内在补丁的对比学习方案。该方案使常用于图像分类的现有对比学习技术能有效服务于密集预测下游任务(如语义分割)。该方案执行有监督的补丁级对比学习,依据真实标记掩码选取补丁,进而用于困难负样本与困难正样本采样。该方案适用于所有视觉Transformer架构,实现简便且仅引入极小的额外内存开销。此外,由于每个补丁被视为独立图像,该方案消除了对大批量尺寸的需求。我们针对低分辨率数据、严重类别不平衡及语义类别相似为特点的航拍图像分割场景,应用并测试了对比Transformer。通过在ISPRS Potsdam航拍图像分割数据集上的大量实验,验证了对比Transformer方案的有效性。同时,通过将其应用于多种本质不同的Transformer架构,展示了方案的可泛化性。最终结果表明,所有类别的平均交并比(mean IoU)均获得一致性提升。