Vision backbone networks play a central role in modern computer vision. Enhancing their efficiency directly benefits a wide range of downstream applications. To measure efficiency, many publications rely on MACs (Multiply Accumulate operations) as a predictor of execution time. In this paper, we experimentally demonstrate the shortcomings of such a metric, especially in the context of edge devices. By contrasting the MAC count and execution time of common architectural design elements, we identify key factors for efficient execution and provide insights to optimize backbone design. Based on these insights, we present LowFormer, a novel vision backbone family. LowFormer features a streamlined macro and micro design that includes Lowtention, a lightweight alternative to Multi-Head Self-Attention. Lowtention not only proves more efficient, but also enables superior results on ImageNet. Additionally, we present an edge GPU version of LowFormer, that can further improve upon its baseline's speed on edge GPU and desktop GPU. We demonstrate LowFormer's wide applicability by evaluating it on smaller image classification datasets, as well as adapting it to several downstream tasks, such as object detection, semantic segmentation, image retrieval, and visual object tracking. LowFormer models consistently achieve remarkable speed-ups across various hardware platforms compared to recent state-of-the-art backbones. Code and models are available at https://github.com/altair199797/LowFormer/blob/main/Beyond_MACs.md.
翻译:视觉骨干网络在现代计算机视觉中扮演核心角色。提升其效率可直接惠及广泛下游应用。为衡量效率,许多研究依赖MACs(乘累加操作数)作为执行时间的预测指标。本文通过实验证明了该指标在边缘设备场景下的局限性。通过对比常见架构设计元素的MAC计数与实际执行时间,我们识别了高效执行的关键因素,并提出了优化骨干网络设计的洞见。基于这些发现,我们提出新型视觉骨干家族LowFormer。LowFormer采用精简的宏观与微观设计,包含轻量级多头自注意力替代方案Lowtention。Lowtention不仅更高效,还能在ImageNet上实现更优结果。此外,我们提出边缘GPU版本LowFormer,可进一步在边缘GPU和桌面GPU上提升基线模型速度。通过在小规模图像分类数据集上的评估及对目标检测、语义分割、图像检索和视觉目标跟踪等下游任务的适配,我们证明了LowFormer的广泛适用性。相较于近期最先进骨干网络,LowFormer模型在多种硬件平台上均实现显著加速。代码与模型已开源至https://github.com/altair199797/LowFormer/blob/main/Beyond_MACs.md。