Advancements in computer vision research have put transformer architecture as the state of the art in computer vision tasks. One of the known drawbacks of the transformer architecture is the high number of parameters, this can lead to a more complex and inefficient algorithm. This paper aims to reduce the number of parameters and in turn, made the transformer more efficient. We present Sparse Transformer (SparTa) Block, a modified transformer block with an addition of a sparse token converter that reduces the number of tokens used. We use the SparTa Block inside the Swin T architecture (SparseSwin) to leverage Swin capability to downsample its input and reduce the number of initial tokens to be calculated. The proposed SparseSwin model outperforms other state of the art models in image classification with an accuracy of 86.96%, 97.43%, and 85.35% on the ImageNet100, CIFAR10, and CIFAR100 datasets respectively. Despite its fewer parameters, the result highlights the potential of a transformer architecture using a sparse token converter with a limited number of tokens to optimize the use of the transformer and improve its performance.
翻译:计算机视觉研究的进展使Transformer架构成为视觉任务的最新技术水平。Transformer架构的一个已知缺点是其参数数量庞大,这可能导致算法复杂且效率低下。本文旨在减少参数数量,从而提升Transformer的效率。我们提出了稀疏Transformer块(SparTa Block),这是一种改进的Transformer块,通过增加稀疏令牌转换器来减少使用的令牌数量。我们在Swin T架构(SparseSwin)内部使用SparTa Block,以利用Swin对输入进行下采样并减少初始计算令牌数量的能力。所提出的SparseSwin模型在图像分类任务中表现优于其他最新模型,在ImageNet100、CIFAR10和CIFAR100数据集上分别取得了86.96%、97.43%和85.35%的准确率。尽管其参数更少,但结果凸显了使用有限数量令牌的稀疏令牌转换器的Transformer架构在优化Transformer利用率和提升其性能方面的潜力。