LaVIDE: A Language-Vision Discriminator for Detecting Changes in Satellite Image with Map References

Change detection, which typically relies on the comparison of bi-temporal images, is significantly hindered when only a single image is available. Comparing a single image with an existing map, such as OpenStreetMap, which is continuously updated through crowd-sourcing, offers a viable solution to this challenge. Unlike images that carry low-level visual details of ground objects, maps convey high-level categorical information. This discrepancy in abstraction levels complicates the alignment and comparison of the two data types. In this paper, we propose a \textbf{La}nguage-\textbf{VI}sion \textbf{D}iscriminator for d\textbf{E}tecting changes in satellite image with map references, namely \ours{}, which leverages language to bridge the information gap between maps and images. Specifically, \ours{} formulates change detection as the problem of ``{\textit Does the pixel belong to [class]?}'', aligning maps and images within the feature space of the language-vision model to associate high-level map categories with low-level image details. Moreover, we build a mixture-of-experts discriminative module, which compares linguistic features from maps with visual features from images across various semantic perspectives, achieving comprehensive semantic comparison for change detection. Extensive evaluation on four benchmark datasets demonstrates that \ours{} can effectively detect changes in satellite image with map references, outperforming state-of-the-art change detection algorithms, e.g., with gains of about $13.8$\% on the DynamicEarthNet dataset and $4.3$\% on the SECOND dataset.

翻译：变化检测通常依赖于双时相影像的对比，当仅能获取单幅影像时，该任务会受到显著制约。将单幅影像与现有地图（如通过众包持续更新的OpenStreetMap）进行对比，为此挑战提供了一个可行的解决方案。与承载地物底层视觉细节的影像不同，地图传达的是高层级的类别信息。这种抽象层次的差异使得两种数据类型的对齐与比较变得复杂。本文提出一种用于检测带地图参考的卫星影像变化的**La**nguage-**VI**sion **D**iscriminator（语言-视觉判别器），即\ours{}，其利用语言来弥合地图与影像之间的信息鸿沟。具体而言，\ours{}将变化检测形式化为“{\textit 该像素是否属于[类别]？}”的问题，在地图与影像的特征空间内，借助语言-视觉模型将高层级的地图类别与低层级的影像细节关联起来。此外，我们构建了一个专家混合判别模块，该模块从多个语义视角对比地图的语言特征与影像的视觉特征，实现了面向变化检测的全面语义比较。在四个基准数据集上的广泛评估表明，\ours{}能够有效检测带地图参考的卫星影像中的变化，其性能优于当前最先进的变化检测算法，例如在DynamicEarthNet数据集上提升约$13.8$\%，在SECOND数据集上提升约$4.3$\%。