Self-attention in Transformers comes with a high computational cost because of their quadratic computational complexity, but their effectiveness in addressing problems in language and vision has sparked extensive research aimed at enhancing their efficiency. However, diverse experimental conditions, spanning multiple input domains, prevent a fair comparison based solely on reported results, posing challenges for model selection. To address this gap in comparability, we perform a large-scale benchmark of more than 45 models for image classification, evaluating key efficiency aspects, including accuracy, speed, and memory usage. Our benchmark provides a standardized baseline for efficiency-oriented transformers. We analyze the results based on the Pareto front -- the boundary of optimal models. Surprisingly, despite claims of other models being more efficient, ViT remains Pareto optimal across multiple metrics. We observe that hybrid attention-CNN models exhibit remarkable inference memory- and parameter-efficiency. Moreover, our benchmark shows that using a larger model in general is more efficient than using higher resolution images. Thanks to our holistic evaluation, we provide a centralized resource for practitioners and researchers, facilitating informed decisions when selecting or developing efficient transformers.
翻译:Transformer中的自注意力机制因其二次计算复杂度而带来高昂的计算成本,但它们在解决语言和视觉问题方面的有效性引发了大量旨在提升其效率的研究。然而,跨越多输入领域的多样化实验条件,使得仅基于报告结果进行公平比较变得困难,这给模型选择带来了挑战。为解决这种可比性上的不足,我们对超过45个图像分类模型进行了大规模基准测试,评估了包括准确率、速度和内存使用在内的关键效率指标。我们的基准测试为效率导向的Transformer提供了标准化的基线。我们基于帕累托前沿——最优模型的边界——对结果进行了分析。令人惊讶的是,尽管有其他模型声称更高效,ViT在多项指标上仍保持帕累托最优。我们观察到,混合注意力-CNN模型在推理内存和参数效率方面表现出色。此外,我们的基准测试表明,通常使用更大的模型比使用更高分辨率的图像更高效。得益于我们的整体评估,我们为从业者和研究人员提供了一个集中化的资源,有助于在选择或开发高效Transformer时做出明智决策。