In this paper, we present Change3D, a framework that reconceptualizes the change detection and captioning tasks through video modeling. Recent methods have achieved remarkable success by regarding each pair of bi-temporal images as separate frames. They employ a shared-weight image encoder to extract spatial features and then use a change extractor to capture differences between the two images. However, image feature encoding, being a task-agnostic process, cannot attend to changed regions effectively. Furthermore, different change extractors designed for various change detection and captioning tasks make it difficult to have a unified framework. To tackle these challenges, Change3D regards the bi-temporal images as comprising two frames akin to a tiny video. By integrating learnable perception frames between the bi-temporal images, a video encoder enables the perception frames to interact with the images directly and perceive their differences. Therefore, we can get rid of the intricate change extractors, providing a unified framework for different change detection and captioning tasks. We verify Change3D on multiple tasks, encompassing change detection (including binary change detection, semantic change detection, and building damage assessment) and change captioning, across eight standard benchmarks. Without bells and whistles, this simple yet effective framework can achieve superior performance with an ultra-light video model comprising only ~6%-13% of the parameters and ~8%-34% of the FLOPs compared to state-of-the-art methods. We hope that Change3D could be an alternative to 2D-based models and facilitate future research.
翻译:本文提出Change3D,一个通过视频建模重新构建变化检测与描述任务的框架。现有方法将每对双时相图像视为独立帧,采用共享权重的图像编码器提取空间特征,再通过变化提取器捕获两幅图像间的差异,已取得显著成功。然而,图像特征编码作为任务无关的过程,难以有效关注变化区域。此外,针对不同变化检测与描述任务设计的各类变化提取器,使得构建统一框架面临困难。为应对这些挑战,Change3D将双时相图像视作由两帧构成的微型视频。通过在双时相图像间嵌入可学习的感知帧,视频编码器使感知帧能够直接与图像交互并感知其差异。由此我们得以摆脱复杂的变化提取器,为不同变化检测与描述任务提供统一框架。我们在八项标准基准上对Change3D进行了多任务验证,涵盖变化检测(包括二值变化检测、语义变化检测和建筑物损毁评估)与变化描述任务。无需复杂设计,这一简洁而高效的框架仅使用约6%-13%的参数和约8%-34%的FLOPs(相较于现有最优方法),即可实现卓越性能。我们希望Change3D能成为基于二维图像模型的替代方案,并推动未来研究。