Multimodal Fusion Transformer for Remote Sensing Image Classification

Vision transformers (ViTs) have been trending in image classification tasks due to their promising performance when compared to convolutional neural networks (CNNs). As a result, many researchers have tried to incorporate ViTs in hyperspectral image (HSI) classification tasks. To achieve satisfactory performance, close to that of CNNs, transformers need fewer parameters. ViTs and other similar transformers use an external classification (CLS) token which is randomly initialized and often fails to generalize well, whereas other sources of multimodal datasets, such as light detection and ranging (LiDAR) offer the potential to improve these models by means of a CLS. In this paper, we introduce a new multimodal fusion transformer (MFT) network which comprises a multihead cross patch attention (mCrossPA) for HSI land-cover classification. Our mCrossPA utilizes other sources of complementary information in addition to the HSI in the transformer encoder to achieve better generalization. The concept of tokenization is used to generate CLS and HSI patch tokens, helping to learn a {distinctive representation} in a reduced and hierarchical feature space. Extensive experiments are carried out on {widely used benchmark} datasets {i.e.,} the University of Houston, Trento, University of Southern Mississippi Gulfpark (MUUFL), and Augsburg. We compare the results of the proposed MFT model with other state-of-the-art transformers, classical CNNs, and conventional classifiers models. The superior performance achieved by the proposed model is due to the use of multihead cross patch attention. The source code will be made available publicly at \url{https://github.com/AnkurDeria/MFT}.}

翻译：视觉Transformer（ViTs）因其相较于卷积神经网络（CNNs）在图像分类任务中表现出的优异性能而成为研究热点。因此，许多研究者尝试将ViTs引入高光谱图像（HSI）分类任务中。为获得接近CNNs的满意性能，Transformer所需的参数量更少。ViT及其他类似Transformer使用外部随机初始化的分类（CLS）令牌，但该令牌通常难以实现良好的泛化能力，而光探测与测距（LiDAR）等多模态数据集为通过CLS改进模型提供了可能性。本文提出一种新型多模态融合Transformer（MFT）网络，其核心为用于HSI土地覆盖分类的多头交叉补丁注意力（mCrossPA）模块。该模块在Transformer编码器中除利用HSI信息外，还引入其他互补信息源以实现更优泛化。通过令牌化概念生成CLS和HSI补丁令牌，有助于在精简的分层特征空间中学习差异化表征。在广泛使用的基准数据集（包括休斯顿大学、特伦托、南密西西比大学海湾公园（MUUFL）及奥格斯堡数据集）上展开大量实验，将所提出的MFT模型与当前最优Transformer、经典CNN及传统分类器进行性能对比。模型取得的优越性能归因于多头交叉补丁注意力的应用。源代码将公开于 \url{https://github.com/AnkurDeria/MFT}。