Sparse Spatial Transformers for Few-Shot Learning

Learning from limited data is challenging because data scarcity leads to a poor generalization of the trained model. A classical global pooled representation will probably lose useful local information. Many few-shot learning methods have recently addressed this challenge using deep descriptors and learning a pixel-level metric. However, using deep descriptors as feature representations may lose image contextual information. Moreover, most of these methods independently address each class in the support set, which cannot sufficiently use discriminative information and task-specific embeddings. In this paper, we propose a novel transformer-based neural network architecture called sparse spatial transformers (SSFormers), which finds task-relevant features and suppresses task-irrelevant features. Particularly, we first divide each input image into several image patches of different sizes to obtain dense local features. These features retain contextual information while expressing local information. Then, a sparse spatial transformer layer is proposed to find spatial correspondence between the query image and the full support set to select task-relevant image patches and suppress task-irrelevant image patches. Finally, we propose using an image patch-matching module to calculate the distance between dense local representations, thus determining which category the query image belongs to in the support set. Extensive experiments on popular few-shot learning benchmarks demonstrate the superiority of our method over state-of-the-art methods. Our source code is available at \url{https://github.com/chenhaoxing/ssformers}.

翻译：从有限数据中学习具有挑战性，因为数据稀缺会导致训练模型的泛化能力较差。经典的全局池化表示可能会丢失有用的局部信息。近年来，许多少样本学习方法通过深度描述符和像素级度量学习来解决这一挑战。然而，使用深度描述符作为特征表示可能会丢失图像的上下文信息。此外，大多数这些方法独立处理支持集中的每个类别，无法充分利用判别性信息和任务特定嵌入。在本文中，我们提出了一种新颖的基于变换器的神经网络架构，称为稀疏空间变换器（SSFormers），该架构能够找到与任务相关的特征，并抑制与任务无关的特征。具体而言，我们首先将每个输入图像划分为若干不同大小的图像块，以获得密集的局部特征。这些特征在表达局部信息的同时保留了上下文信息。然后，提出了一种稀疏空间变换器层，用于找到查询图像与整个支持集之间的空间对应关系，从而选择与任务相关的图像块并抑制与任务无关的图像块。最后，我们提出使用图像块匹配模块计算密集局部表示之间的距离，从而确定查询图像属于支持集中的哪个类别。在流行的少样本学习基准上的大量实验表明，我们的方法优于现有最先进方法。我们的源代码可在 \url{https://github.com/chenhaoxing/ssformers} 获取。