Fine-grained classification is a challenging task that involves identifying subtle differences between objects within the same category. This task is particularly challenging in scenarios where data is scarce. Visual transformers (ViT) have recently emerged as a powerful tool for image classification, due to their ability to learn highly expressive representations of visual data using self-attention mechanisms. In this work, we explore Semi-ViT, a ViT model fine tuned using semi-supervised learning techniques, suitable for situations where we have lack of annotated data. This is particularly common in e-commerce, where images are readily available but labels are noisy, nonexistent, or expensive to obtain. Our results demonstrate that Semi-ViT outperforms traditional convolutional neural networks (CNN) and ViTs, even when fine-tuned with limited annotated data. These findings indicate that Semi-ViTs hold significant promise for applications that require precise and fine-grained classification of visual data.
翻译:细粒度分类是一项具有挑战性的任务,需要识别同一类别内对象之间的细微差异。在数据稀缺的场景下,该任务尤为困难。视觉Transformer(ViT)近期因其利用自注意力机制学习视觉数据高表达性表征的能力,已成为图像分类领域的强大工具。本研究探索了Semi-ViT——一种采用半监督学习技术微调的ViT模型,适用于标注数据匮乏的情形。这在电子商务领域尤为常见:虽然图像数据易于获取,但标签往往存在噪声、缺失或获取成本高昂。实验结果表明,即使在有限标注数据微调的情况下,Semi-ViT仍优于传统卷积神经网络(CNN)和标准ViT。这些发现表明,Semi-ViT在需要精准细粒度视觉数据分类的应用中具有重要发展潜力。