Our objective is to discover and localize monotonic temporal changes in a sequence of images. To achieve this, we exploit a simple proxy task of ordering a shuffled image sequence, with `time' serving as a supervisory signal, since only changes that are monotonic with time can give rise to the correct ordering. We also introduce a transformer-based model for ordering of image sequences of arbitrary length with built-in attribution maps. After training, the model successfully discovers and localizes monotonic changes while ignoring cyclic and stochastic ones. We demonstrate applications of the model in multiple domains covering different scene and object types, discovering both object-level and environmental changes in unseen sequences. We also demonstrate that the attention-based attribution maps function as effective prompts for segmenting the changing regions, and that the learned representations can be used for downstream applications. Finally, we show that the model achieves the state-of-the-art on standard benchmarks for image ordering.
翻译:我们的目标是发现并定位图像序列中的单调时序变化。为实现这一目标,我们利用图像序列排序这一简单代理任务,以"时间"作为监督信号,因为只有随时间单调变化的过程才能产生正确的排序结果。我们还提出了一种基于Transformer的模型,该模型能够处理任意长度的图像序列,并内置归因图生成机制。训练完成后,该模型能成功发现并定位单调变化,同时忽略周期性和随机性变化。我们在涵盖不同场景和物体类型的多个领域中展示了模型的应用效果,能够在未见过的序列中发现物体层面和环境层面的变化。我们还证明基于注意力的归因图可作为分割变化区域的有效提示,且学习到的表征可用于下游任务。最后,我们在标准图像排序基准测试中验证了该模型达到了最先进的性能水平。