In this paper we introduce the Temporo-Spatial Vision Transformer (TSViT), a fully-attentional model for general Satellite Image Time Series (SITS) processing based on the Vision Transformer (ViT). TSViT splits a SITS record into non-overlapping patches in space and time which are tokenized and subsequently processed by a factorized temporo-spatial encoder. We argue, that in contrast to natural images, a temporal-then-spatial factorization is more intuitive for SITS processing and present experimental evidence for this claim. Additionally, we enhance the model's discriminative power by introducing two novel mechanisms for acquisition-time-specific temporal positional encodings and multiple learnable class tokens. The effect of all novel design choices is evaluated through an extensive ablation study. Our proposed architecture achieves state-of-the-art performance, surpassing previous approaches by a significant margin in three publicly available SITS semantic segmentation and classification datasets. All model, training and evaluation codes are made publicly available to facilitate further research.
翻译:本文提出了时空视觉Transformer(TSViT),一种基于视觉Transformer(ViT)的通用卫星图像时间序列(SITS)全注意力处理模型。TSViT将SITS记录在空间和时间维度上分割为不重叠的图块,这些图块经令牌化处理后由分解的时空编码器进一步处理。我们认为,与自然图像不同,时间优先于空间的分解方式更适用于SITS处理,并通过实验证据支持这一论断。此外,通过引入两种新颖机制——采集时间特异性时间位置编码和多个可学习类别令牌——增强了模型的判别能力。通过广泛的消融研究评估了所有新颖设计选择的效果。所提出的架构在三个公开SITS语义分割与分类数据集中均取得了最优性能,显著超越了先前方法。为促进后续研究,所有模型、训练及评估代码均已公开。