Satellite Image Time Series (SITS) representation learning is complex due to high spatiotemporal resolutions, irregular acquisition times, and intricate spatiotemporal interactions. These challenges result in specialized neural network architectures tailored for SITS analysis. The field has witnessed promising results achieved by pioneering researchers, but transferring the latest advances or established paradigms from Computer Vision (CV) to SITS is still highly challenging due to the existing suboptimal representation learning framework. In this paper, we develop a novel perspective of SITS processing as a direct set prediction problem, inspired by the recent trend in adopting query-based transformer decoders to streamline the object detection or image segmentation pipeline. We further propose to decompose the representation learning process of SITS into three explicit steps: collect-update-distribute, which is computationally efficient and suits for irregularly-sampled and asynchronous temporal satellite observations. Facilitated by the unique reformulation, our proposed temporal learning backbone of SITS, initially pre-trained on the resource efficient pixel-set format and then fine-tuned on the downstream dense prediction tasks, has attained new state-of-the-art (SOTA) results on the PASTIS benchmark dataset. Specifically, the clear separation between temporal and spatial components in the semantic/panoptic segmentation pipeline of SITS makes us leverage the latest advances in CV, such as the universal image segmentation architecture, resulting in a noticeable 2.5 points increase in mIoU and 8.8 points increase in PQ, respectively, compared to the best scores reported so far.
翻译:卫星图像时间序列(SITS)表征学习因高时空分辨率、不规则采集时间以及复杂的时空交互而具有较高难度。这些挑战催生了专门针对SITS分析的神经网络架构。尽管该领域已出现先驱研究者取得的显著成果,但由于现有表征学习框架尚不完善,将计算机视觉(CV)领域的最新进展或成熟范式迁移至SITS仍极具挑战性。本文受近期采用基于查询的Transformer解码器简化目标检测或图像分割流程的趋势启发,提出将SITS处理视为直接集合预测问题的新视角。我们进一步提出将SITS的表征学习过程分解为三个明确步骤:收集-更新-分布(collect-update-distribute),该过程计算高效且适用于不规则采样与非同步的时序卫星观测数据。基于这一独特重构,我们提出的SITS时序学习骨干网络,先在资源高效的像素集格式上进行预训练,再针对下游密集预测任务进行微调,已在PASTIS基准数据集上取得新最优(SOTA)结果。值得注意的是,通过将SITS语义/全景分割流程中时序与空间成分的清晰解耦,我们得以利用CV领域的最新进展(如通用图像分割架构),相比先前报道的最佳分数,mIoU与PQ分别提升2.5个百分点和8.8个百分点。