Rethinking the Encoding of Satellite Image Time Series

Representation learning of Satellite Image Time Series (SITS) presents its unique challenges, such as prohibitive computation burden caused by high spatiotemporal resolutions, irregular acquisition times, and complex spatiotemporal interactions, leading to highly-specialized neural network architectures for SITS analysis. Despite the promising results achieved by some pioneering work, we argue that satisfactory representation learning paradigms have not yet been established for SITS analysis, causing an isolated island where transferring successful paradigms or the latest advances from Computer Vision (CV) to SITS is arduous. In this paper, we develop a unique perspective of SITS processing as a direct set prediction problem, inspired by the recent trend in adopting query-based transformer decoders to streamline the object detection or image segmentation pipeline, and further propose to decompose the representation learning process of SITS into three explicit steps: collect--update--distribute, which is computationally efficient and suits for irregularly-sampled and asynchronous temporal observations. Facilitated by the unique reformulation and effective feature extraction framework proposed, our models pre-trained on pixel-set format input and then fine-tuned on downstream dense prediction tasks by simply appending a commonly-used segmentation network have attained new state-of-the-art (SoTA) results on PASTIS dataset compared to bespoke neural architectures such as U-TAE. Furthermore, the clear separation, conceptually and practically, between temporal and spatial components in the panoptic segmentation pipeline of SITS allows us to leverage the recent advances in CV, such as Mask2Former, a universal segmentation architecture, resulting in a noticeable 8.8 points increase in PQ compared to the best score reported so far.

翻译：卫星图像时间序列（SITS）的表征学习面临着独特挑战，例如由高时空分辨率导致的巨大计算负担、非规则采集时间以及复杂的时空交互作用，从而催生出高度专业化的神经网络架构用于SITS分析。尽管一些开创性工作取得了令人瞩目的成果，但我们认为SITS分析尚未建立起令人满意的表征学习范式，这导致了一个"孤岛"现象——将计算机视觉（CV）中成功的范式或最新进展迁移至SITS领域困难重重。受近年来采用基于查询的Transformer解码器简化目标检测或图像分割流程的趋势启发，本文提出将SITS处理视为直接集合预测问题的独特视角，并进一步将SITS的表征学习过程分解为三个明确步骤：收集-更新-分布。该方案具有计算效率高且适用于非规则采样与异步时间观测的特点。得益于所提出的新型问题重构框架与高效特征提取方案，我们采用像素集格式输入进行预训练的模型，在通过简单附加常用分割网络完成下游密集预测任务微调后，已在PASTIS数据集上相较于U-TAE等定制化神经架构取得了新的最优结果（SoTA）。此外，SITS全景分割流程中时间与空间组件的概念性与实际性清晰分离，使我们能够利用CV领域的最新进展（如通用分割架构Mask2Former），与当前最佳得分相比，在PQ指标上实现了显著提升（8.8个百分点）。