We present a novel method for scene change detection that leverages the robust feature extraction capabilities of a visual foundational model, DINOv2, and integrates full-image cross-attention to address key challenges such as varying lighting, seasonal variations, and viewpoint differences. In order to effectively learn correspondences and mis-correspondences between an image pair for the change detection task, we propose to a) ``freeze'' the backbone in order to retain the generality of dense foundation features, and b) employ ``full-image'' cross-attention to better tackle the viewpoint variations between the image pair. We evaluate our approach on two benchmark datasets, VL-CMU-CD and PSCD, along with their viewpoint-varied versions. Our experiments demonstrate significant improvements in F1-score, particularly in scenarios involving geometric changes between image pairs. The results indicate our method's superior generalization capabilities over existing state-of-the-art approaches, showing robustness against photometric and geometric variations as well as better overall generalization when fine-tuned to adapt to new environments. Detailed ablation studies further validate the contributions of each component in our architecture. Source code will be made publicly available upon acceptance.
翻译:我们提出了一种新颖的场景变化检测方法,该方法利用视觉基础模型DINOv2的鲁棒特征提取能力,并集成全图像交叉注意力机制,以应对光照变化、季节变化和视角差异等关键挑战。为了有效学习图像对在变化检测任务中的对应关系与非对应关系,我们提出:a) “冻结”主干网络以保留密集基础特征的泛化性;b) 采用“全图像”交叉注意力以更好地处理图像对之间的视角变化。我们在VL-CMU-CD和PSCD两个基准数据集及其视角变化版本上评估了我们的方法。实验结果表明,该方法在F1分数上取得了显著提升,尤其是在涉及图像对间几何变化的场景中。结果证明,相较于现有最先进方法,我们的方法具有更优异的泛化能力,在应对光度与几何变化时表现出鲁棒性,且在微调以适应新环境时展现出更好的整体泛化性能。详细的消融研究进一步验证了我们架构中各组件的贡献。源代码将在论文录用后公开。