Using images acquired by different satellite sensors has shown to improve classification performance in the framework of crop mapping from satellite image time series (SITS). Existing state-of-the-art architectures use self-attention mechanisms to process the temporal dimension and convolutions for the spatial dimension of SITS. Motivated by the success of purely attention-based architectures in crop mapping from single-modal SITS, we introduce several multi-modal multi-temporal transformer-based architectures. Specifically, we investigate the effectiveness of Early Fusion, Cross Attention Fusion and Synchronized Class Token Fusion within the Temporo-Spatial Vision Transformer (TSViT). Experimental results demonstrate significant improvements over state-of-the-art architectures with both convolutional and self-attention components.
翻译:利用不同卫星传感器获取的图像已被证明能提升卫星图像时间序列作物制图框架中的分类性能。现有最先进的架构采用自注意力机制处理SITS的时间维度,并利用卷积处理其空间维度。受纯注意力架构在单模态SITS作物制图中成功的启发,本文提出了多种基于多模态多时序Transformer的架构。具体而言,我们研究了早期融合、交叉注意力融合和同步类别标记融合在时空视觉Transformer内的有效性。实验结果表明,相较于同时包含卷积和自注意力组件的现有最先进架构,所提方法取得了显著改进。