Transformers come with a high computational cost, yet their effectiveness in addressing problems in language and vision has sparked extensive research aimed at enhancing their efficiency. However, diverse experimental conditions, spanning multiple input domains, prevent a fair comparison based solely on reported results, posing challenges for model selection. To address this gap in comparability, we design a comprehensive benchmark of more than 30 models for image classification, evaluating key efficiency aspects, including accuracy, speed, and memory usage. This benchmark provides a standardized baseline across the landscape of efficiency-oriented transformers and our framework of analysis, based on Pareto optimality, reveals surprising insights. Despite claims of other models being more efficient, ViT remains Pareto optimal across multiple metrics. We observe that hybrid attention-CNN models exhibit remarkable inference memory- and parameter-efficiency. Moreover, our benchmark shows that using a larger model in general is more efficient than using higher resolution images. Thanks to our holistic evaluation, we provide a centralized resource for practitioners and researchers, facilitating informed decisions when selecting transformers or measuring progress of the development of efficient transformers.
翻译:Transformer模型计算成本高昂,但因其在语言和视觉任务中的有效性,催生了大量旨在提升其效率的研究。然而,跨多个输入领域的多样化实验条件,使得仅凭已发表结果难以进行公平比较,给模型选择带来挑战。为解决这一可比性缺失问题,我们设计了一个涵盖30多种图像分类模型的综合基准测试,评估了包括准确率、速度和内存使用在内的关键效率指标。该基准测试为面向效率的Transformer模型提供了标准化基线,而我们基于帕累托最优的分析框架揭示了令人惊讶的发现。尽管有其他模型声称更高效,但ViT在多个指标上仍保持帕累托最优。我们观察到混合注意力-CNN模型在推理内存和参数效率方面表现出色。此外,我们的基准测试表明,通常使用更大的模型比使用更高分辨率图像更高效。通过全面评估,我们为从业者和研究人员提供了集中式资源,助力其在选择Transformer或衡量高效Transformer发展进展时做出明智决策。