This paper addresses the few-shot image classification problem, where the classification task is performed on unlabeled query samples given a small amount of labeled support samples only. One major challenge of the few-shot learning problem is the large variety of object visual appearances that prevents the support samples to represent that object comprehensively. This might result in a significant difference between support and query samples, therefore undermining the performance of few-shot algorithms. In this paper, we tackle the problem by proposing Few-shot Cosine Transformer (FS-CT), where the relational map between supports and queries is effectively obtained for the few-shot tasks. The FS-CT consists of two parts, a learnable prototypical embedding network to obtain categorical representations from support samples with hard cases, and a transformer encoder to effectively achieve the relational map from two different support and query samples. We introduce Cosine Attention, a more robust and stable attention module that enhances the transformer module significantly and therefore improves FS-CT performance from 5% to over 20% in accuracy compared to the default scaled dot-product mechanism. Our method performs competitive results in mini-ImageNet, CUB-200, and CIFAR-FS on 1-shot learning and 5-shot learning tasks across backbones and few-shot configurations. We also developed a custom few-shot dataset for Yoga pose recognition to demonstrate the potential of our algorithm for practical application. Our FS-CT with cosine attention is a lightweight, simple few-shot algorithm that can be applied for a wide range of applications, such as healthcare, medical, and security surveillance. The official implementation code of our Few-shot Cosine Transformer is available at https://github.com/vinuni-vishc/Few-Shot-Cosine-Transformer
翻译:本文针对小样本图像分类问题展开研究,该问题要求仅通过少量带标签的支撑样本对无标签查询样本进行分类。小样本学习面临的主要挑战之一是目标视觉外观的多样性,这使得支撑样本难以全面表征该目标。这可能导致支撑样本与查询样本之间出现显著差异,从而削弱小样本算法的性能。本文通过提出小样本余弦Transformer(FS-CT)解决该问题,该模型能够有效获取小样本任务中支撑样本与查询样本之间的关系映射。FS-CT由两部分组成:一个可学习的原型嵌入网络,用于从包含困难样本的支撑样本中获取类别表征;一个Transformer编码器,用于有效实现两个不同支撑样本与查询样本之间的关系映射。我们引入了余弦注意力(Cosine Attention)——一种更稳健且稳定的注意力模块,该模块显著增强了Transformer模块,从而使FS-CT的准确率较默认的点积缩放机制提升5%至20%以上。我们的方法在mini-ImageNet、CUB-200和CIFAR-FS数据集上的1样本学习与5样本学习任务中,跨骨干网络和小样本配置均取得了具有竞争力的结果。我们还开发了一个用于瑜伽姿势识别的自定义小样本数据集,以展示该算法在实际应用中的潜力。采用余弦注意力的FS-CT是一种轻量级、简单的小样本算法,可广泛应用于医疗健康、医学和安防监控等领域。小样本余弦Transformer的官方实现代码请访问:https://github.com/vinuni-vishc/Few-Shot-Cosine-Transformer