Transformers are widely used for solving tasks in natural language processing, computer vision, speech, and music domains. In this paper, we talk about the efficiency of transformers in terms of memory (the number of parameters), computation cost (number of floating points operations), and performance of models, including accuracy, the robustness of the model, and fair \& bias-free features. We mainly discuss the vision transformer for the image classification task. Our contribution is to introduce an efficient 360 framework, which includes various aspects of the vision transformer, to make it more efficient for industrial applications. By considering those applications, we categorize them into multiple dimensions such as privacy, robustness, transparency, fairness, inclusiveness, continual learning, probabilistic models, approximation, computational complexity, and spectral complexity. We compare various vision transformer models based on their performance, the number of parameters, and the number of floating point operations (FLOPs) on multiple datasets.
翻译:Transformer被广泛用于解决自然语言处理、计算机视觉、语音和音乐领域的任务。本文讨论了Transformer在内存(参数数量)、计算成本(浮点运算次数)以及模型性能(包括准确率、模型鲁棒性和公平无偏特征)方面的效率问题。我们主要针对图像分类任务探讨了视觉Transformer。我们的贡献在于引入了一个效率360框架,该框架涵盖视觉Transformer的各个方面,以使其更适合工业应用。通过考虑这些应用,我们将其分为多个维度,如隐私性、鲁棒性、透明度、公平性、包容性、持续学习、概率模型、近似方法、计算复杂性和频谱复杂性。我们在多个数据集上基于性能、参数数量和浮点运算次数(FLOPs)比较了多种视觉Transformer模型。