SemiCD-VL: Visual-Language Model Guidance Makes Better Semi-supervised Change Detector

Change Detection (CD) aims to identify pixels with semantic changes between images. However, annotating massive numbers of pixel-level images is labor-intensive and costly, especially for multi-temporal images, which require pixel-wise comparisons by human experts. Considering the excellent performance of visual language models (VLMs) for zero-shot, open-vocabulary, etc. with prompt-based reasoning, it is promising to utilize VLMs to make better CD under limited labeled data. In this paper, we propose a VLM guidance-based semi-supervised CD method, namely SemiCD-VL. The insight of SemiCD-VL is to synthesize free change labels using VLMs to provide additional supervision signals for unlabeled data. However, almost all current VLMs are designed for single-temporal images and cannot be directly applied to bi- or multi-temporal images. Motivated by this, we first propose a VLM-based mixed change event generation (CEG) strategy to yield pseudo labels for unlabeled CD data. Since the additional supervised signals provided by these VLM-driven pseudo labels may conflict with the pseudo labels from the consistency regularization paradigm (e.g. FixMatch), we propose the dual projection head for de-entangling different signal sources. Further, we explicitly decouple the bi-temporal images semantic representation through two auxiliary segmentation decoders, which are also guided by VLM. Finally, to make the model more adequately capture change representations, we introduce metric-aware supervision by feature-level contrastive loss in auxiliary branches. Extensive experiments show the advantage of SemiCD-VL. For instance, SemiCD-VL improves the FixMatch baseline by +5.3 IoU on WHU-CD and by +2.4 IoU on LEVIR-CD with 5% labels. In addition, our CEG strategy, in an un-supervised manner, can achieve performance far superior to state-of-the-art un-supervised CD methods.

翻译：变化检测（Change Detection, CD）旨在识别图像间发生语义变化的像素。然而，为海量像素级图像进行标注是劳动密集型且成本高昂的，特别是对于多时相图像，需要专家进行逐像素比对。考虑到视觉语言模型（Visual-Language Models, VLMs）在基于提示推理的零样本、开放词汇等任务上的优异表现，利用VLM在有限标注数据下实现更好的CD具有广阔前景。本文提出一种基于VLM引导的半监督CD方法，命名为SemiCD-VL。SemiCD-VL的核心思想是利用VLM合成自由变化标签，为未标注数据提供额外的监督信号。然而，当前几乎所有VLM均针对单时相图像设计，无法直接应用于双时相或多时相图像。受此启发，我们首先提出一种基于VLM的混合变化事件生成（Change Event Generation, CEG）策略，为未标注CD数据生成伪标签。由于这些VLM驱动的伪标签提供的额外监督信号可能与一致性正则范式（如FixMatch）产生的伪标签存在冲突，我们提出双投影头结构以解耦不同信号源。进一步，我们通过两个辅助分割解码器显式解耦双时相图像的语义表示，该过程同样受VLM引导。最后，为使模型更充分地捕捉变化表征，我们在辅助分支中引入特征级对比损失以实现度量感知监督。大量实验证明了SemiCD-VL的优势。例如，在仅使用5%标注数据时，SemiCD-VL在WHU-CD数据集上较FixMatch基线提升+5.3 IoU，在LEVIR-CD数据集上提升+2.4 IoU。此外，我们的CEG策略以无监督方式实现的性能远超当前最先进的无监督CD方法。