Vision Transformers (ViTs) have emerged as a promising approach for visual recognition tasks, revolutionizing the field by leveraging the power of transformer-based architectures. Among the various ViT models, Swin Transformers have gained considerable attention due to their hierarchical design and ability to capture both local and global visual features effectively. This paper evaluates the performance of Swin ViT model using gradient accumulation optimization (GAO) technique. We investigate the impact of gradient accumulation optimization technique on the model's accuracy and training time. Our experiments show that applying the GAO technique leads to a significant decrease in the accuracy of the Swin ViT model, compared to the standard Swin Transformer model. Moreover, we detect a significant increase in the training time of the Swin ViT model when GAO model is applied. These findings suggest that applying the GAO technique may not be suitable for the Swin ViT model, and concern should be undertaken when using GAO technique for other transformer-based models.
翻译:视觉Transformer(ViTs)已成为视觉识别任务中一种前景广阔的方法,通过利用基于Transformer架构的强大能力,革新了该领域。在各类ViT模型中,Swin Transformer因其层级化设计及有效捕捉局部与全局视觉特征的能力而备受关注。本文评估了采用梯度累积优化(GAO)技术的Swin ViT模型的性能。我们研究了梯度累积优化技术对模型精度和训练时间的影响。实验结果表明,与标准Swin Transformer模型相比,应用GAO技术会导致Swin ViT模型的准确率显著下降。此外,我们发现在使用GAO模型时,Swin ViT模型的训练时间显著增加。这些发现表明,GAO技术可能不适用于Swin ViT模型,并且在对其他基于Transformer的模型使用GAO技术时应谨慎行事。