Transformers are widely used for solving tasks in natural language processing, computer vision, speech, and music domains. In this paper, we talk about the efficiency of transformers in terms of memory (the number of parameters), computation cost (number of floating points operations), and performance of models, including accuracy, the robustness of the model, and fair \& bias-free features. We mainly discuss the vision transformer for the image classification task. Our contribution is to introduce an efficient 360 framework, which includes various aspects of the vision transformer, to make it more efficient for industrial applications. By considering those applications, we categorize them into multiple dimensions such as privacy, robustness, transparency, fairness, inclusiveness, continual learning, probabilistic models, approximation, computational complexity, and spectral complexity. We compare various vision transformer models based on their performance, the number of parameters, and the number of floating point operations (FLOPs) on multiple datasets.
翻译:Transformer被广泛应用于自然语言处理、计算机视觉、语音和音乐领域的任务解决。本文从内存(参数量)、计算成本(浮点运算次数)以及模型性能(包括准确率、模型鲁棒性、公平性与无偏特征)角度探讨Transformer的效率。我们主要针对图像分类任务讨论视觉Transformer。我们的贡献在于引入一个涵盖视觉Transformer多方面考量的高效360框架,旨在提升其工业应用的效率。基于实际应用场景,我们将这些考量维度划分为隐私性、鲁棒性、透明性、公平性、包容性、持续学习、概率模型、近似方法、计算复杂度和谱复杂度等多个方面。我们基于多个数据集,从模型性能、参数量和浮点运算次数(FLOPs)等指标对比分析了多种视觉Transformer模型。