The growing popularity of Vision Transformers as the go-to models for image classification has led to an explosion of architectural modifications claiming to be more efficient than the original ViT. However, a wide diversity of experimental conditions prevents a fair comparison between all of them, based solely on their reported results. To address this gap in comparability, we conduct a comprehensive analysis of more than 30 models to evaluate the efficiency of vision transformers and related architectures, considering various performance metrics. Our benchmark provides a comparable baseline across the landscape of efficiency-oriented transformers, unveiling a plethora of surprising insights. For example, we discover that ViT is still Pareto optimal across multiple efficiency metrics, despite the existence of several alternative approaches claiming to be more efficient. Results also indicate that hybrid attention-CNN models fare particularly well when it comes to low inference memory and number of parameters, and also that it is better to scale the model size, than the image size. Furthermore, we uncover a strong positive correlation between the number of FLOPS and the training memory, which enables the estimation of required VRAM from theoretical measurements alone. Thanks to our holistic evaluation, this study offers valuable insights for practitioners and researchers, facilitating informed decisions when selecting models for specific applications. We publicly release our code and data at https://github.com/tobna/WhatTransformerToFavor
翻译:随着视觉Transformer作为图像分类首选模型的日益普及,大量声称比原始ViT更高效的架构修改方案层出不穷。然而,实验条件的广泛差异使得仅依据它们报告的结果进行公平比较成为难题。为填补这一可比性空白,我们对超过30种模型进行了全面分析,评估了视觉Transformer及相关架构的效率,并考虑了多种性能指标。我们的基准测试为面向效率的Transformer领域提供了可比较的基线,揭示了大量令人惊讶的洞见。例如,我们发现尽管存在多种声称更高效的替代方案,ViT在多个效率指标上仍处于帕累托最优状态。结果还表明,混合注意力-CNN模型在低推理内存和参数量方面表现尤为出色,同时扩展模型规模比扩展图像尺寸更优。此外,我们发现了FLOPS与训练内存之间强烈的正相关性,这使人们能够仅通过理论测量来估计所需的VRAM。通过整体评估,本研究为实践者和研究人员提供了宝贵见解,有助于在特定应用中选择模型时做出明智决策。我们在https://github.com/tobna/WhatTransformerToFavor上公开发布了代码和数据。