In this paper we introduce the Temporo-Spatial Vision Transformer (TSViT), a fully-attentional model for general Satellite Image Time Series (SITS) processing based on the Vision Transformer (ViT). TSViT splits a SITS record into non-overlapping patches in space and time which are tokenized and subsequently processed by a factorized temporo-spatial encoder. We argue, that in contrast to natural images, a temporal-then-spatial factorization is more intuitive for SITS processing and present experimental evidence for this claim. Additionally, we enhance the model's discriminative power by introducing two novel mechanisms for acquisition-time-specific temporal positional encodings and multiple learnable class tokens. The effect of all novel design choices is evaluated through an extensive ablation study. Our proposed architecture achieves state-of-the-art performance, surpassing previous approaches by a significant margin in three publicly available SITS semantic segmentation and classification datasets. All model, training and evaluation codes are made publicly available to facilitate further research.
翻译:本文提出时空视觉Transformer(TSViT),一种基于视觉Transformer(ViT)的通用卫星图像时间序列(SITS)全注意力处理模型。TSViT将SITS记录在空间和时间维度上分割为非重叠的图块,这些图块被标记化后由分解的时空编码器处理。我们认为,与自然图像不同,时间-空间分解处理对SITS更直观,并通过实验证据支持这一主张。此外,我们通过引入两种新机制——采集时间特定的时间位置编码和多个可学习类别令牌——增强了模型的判别能力。通过广泛的消融研究评估了所有新颖设计选择的效果。我们提出的架构在三个公开的SITS语义分割和分类数据集上实现了最先进的性能,显著超越了先前的方法。所有模型、训练和评估代码均已公开,以促进进一步研究。