The architecture of Vision Transformers (ViTs), particularly the Multi-head Attention (MHA) mechanism, imposes substantial hardware demands. Deploying ViTs on devices with varying constraints, such as mobile phones, requires multiple models of different sizes. However, this approach has limitations, such as training and storing each required model separately. This paper introduces HydraViT, a novel approach that addresses these limitations by stacking attention heads to achieve a scalable ViT. By repeatedly changing the size of the embedded dimensions throughout each layer and their corresponding number of attention heads in MHA during training, HydraViT induces multiple subnetworks. Thereby, HydraViT achieves adaptability across a wide spectrum of hardware environments while maintaining performance. Our experimental results demonstrate the efficacy of HydraViT in achieving a scalable ViT with up to 10 subnetworks, covering a wide range of resource constraints. HydraViT achieves up to 5 p.p. more accuracy with the same GMACs and up to 7 p.p. more accuracy with the same throughput on ImageNet-1K compared to the baselines, making it an effective solution for scenarios where hardware availability is diverse or varies over time. Source code available at https://github.com/ds-kiel/HydraViT.
翻译:视觉Transformer(ViT)的架构,特别是多头注意力(MHA)机制,对硬件提出了很高的要求。在具有不同约束条件(例如手机)的设备上部署ViT,需要多个不同大小的模型。然而,这种方法存在局限性,例如需要单独训练和存储每个所需的模型。本文介绍了HydraViT,这是一种新颖的方法,通过堆叠注意力头来实现可扩展的ViT,从而解决了这些局限性。通过在训练过程中,反复改变每一层中嵌入维度的大小及其在MHA中对应的注意力头数量,HydraViT诱导出多个子网络。因此,HydraViT能够在保持性能的同时,实现跨广泛硬件环境的适应性。我们的实验结果证明了HydraViT在实现可扩展ViT方面的有效性,最多可包含10个子网络,覆盖了广泛的资源约束范围。与基线模型相比,在ImageNet-1K数据集上,HydraViT在相同GMACs下实现了高达5个百分点的精度提升,在相同吞吐量下实现了高达7个百分点的精度提升,这使其成为硬件可用性多样或随时间变化场景下的有效解决方案。源代码位于 https://github.com/ds-kiel/HydraViT。