We investigate the problem of identifying objects that have been added, removed, or moved between a pair of captures (images or videos) of the same scene at different times. Accurately identifying verifiable changes is extremely challenging -- some objects may appear to be missing because they are occluded or out of frame, while others may appear different due to large viewpoint changes. To study this problem, we introduce the SceneDiff Benchmark, the first multiview change detection dataset for scenes captured along different camera trajectories, comprising 350 diverse video pairs with dense object instance-level annotations. We also introduce the SceneDiff algorithm, a training-free approach that solves for image poses, segments images into objects, and compares them using semantic and geometric features. By building on pretrained models, SceneDiff generalizes across domains without retraining and naturally improves as the underlying models advance. Experiments on multiview and two-view benchmarks demonstrate that our method outperforms existing approaches by large margins (53.0\% and 30.6\% relative AP improvements). Project page: https://yuqunw.github.io/SceneDiff
翻译:我们研究了在同一场景不同时间拍摄的两组图像或视频中,识别被添加、移除或移动的物体的问题。准确识别可验证的变化极具挑战性——某些物体可能因遮挡或超出画面而看似缺失,另一些则可能因视角大幅变化而呈现不同外观。为研究该问题,我们提出了SceneDiff基准,这是首个针对不同相机轨迹捕获场景的多视角变化检测数据集,包含350对丰富视频及其密集的物体实例级标注。我们还提出了SceneDiff算法,这是一种无需训练的解决方案,通过解算图像位姿、分割图像中的物体,并利用语义与几何特征进行比较。该算法基于预训练模型构建,无需重新训练即可跨领域泛化,且随着底层模型进步而自然提升。在多视角和双视角基准上的实验表明,我们的方法大幅超越了现有方法(相对AP提升分别为53.0%和30.6%)。项目主页:https://yuqunw.github.io/SceneDiff