Change3D: Revisiting Change Detection and Captioning from A Video Modeling Perspective

In this paper, we present Change3D, a framework that reconceptualizes the change detection and captioning tasks through video modeling. Recent methods have achieved remarkable success by regarding each pair of bi-temporal images as separate frames. They employ a shared-weight image encoder to extract spatial features and then use a change extractor to capture differences between the two images. However, image feature encoding, being a task-agnostic process, cannot attend to changed regions effectively. Furthermore, different change extractors designed for various change detection and captioning tasks make it difficult to have a unified framework. To tackle these challenges, Change3D regards the bi-temporal images as comprising two frames akin to a tiny video. By integrating learnable perception frames between the bi-temporal images, a video encoder enables the perception frames to interact with the images directly and perceive their differences. Therefore, we can get rid of the intricate change extractors, providing a unified framework for different change detection and captioning tasks. We verify Change3D on multiple tasks, encompassing change detection (including binary change detection, semantic change detection, and building damage assessment) and change captioning, across eight standard benchmarks. Without bells and whistles, this simple yet effective framework can achieve superior performance with an ultra-light video model comprising only ~6%-13% of the parameters and ~8%-34% of the FLOPs compared to state-of-the-art methods. We hope that Change3D could be an alternative to 2D-based models and facilitate future research.

翻译：本文提出Change3D，一个通过视频建模重新构建变化检测与描述任务的框架。现有方法将每对双时相图像视为独立帧，采用共享权重的图像编码器提取空间特征，再通过变化提取器捕获两幅图像间的差异，已取得显著成功。然而，图像特征编码作为任务无关的过程，难以有效关注变化区域。此外，针对不同变化检测与描述任务设计的各类变化提取器，使得构建统一框架面临困难。为应对这些挑战，Change3D将双时相图像视作由两帧构成的微型视频。通过在双时相图像间嵌入可学习的感知帧，视频编码器使感知帧能够直接与图像交互并感知其差异。由此我们得以摆脱复杂的变化提取器，为不同变化检测与描述任务提供统一框架。我们在八项标准基准上对Change3D进行了多任务验证，涵盖变化检测（包括二值变化检测、语义变化检测和建筑物损毁评估）与变化描述任务。无需复杂设计，这一简洁而高效的框架仅使用约6%-13%的参数和约8%-34%的FLOPs（相较于现有最优方法），即可实现卓越性能。我们希望Change3D能成为基于二维图像模型的替代方案，并推动未来研究。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日