Vision Transformers (ViTs) achieve state-of-the-art performance on challenging vision tasks, but their deployment on edge devices is severely hindered by the computational complexity and global reduction bottleneck imposed by layer normalization. Recent methods attempt to bypass this by replacing normalization layers with hardware-friendly scalar approximations. However, these homogeneous replacements do not optimally fit to all layers' behaviour and rely on expensive model retraining. In this work, we propose a highly efficient, hardware-aware framework that utilizes genetic programming (GP) to evolve heterogeneous, layer-specific scalar functions directly from pre-trained weights. Coupled with a novel post-training re-alignment strategy, our approach eliminates the need to retrain models from scratch entirely. Our evolved expressions accurately approximate the target normalization behaviours, capturing $91.6\%$ of the variance ($R^2$) compared to only $70.2\%$ for homogeneous baselines, allowing our modified architecture to recover $84.25\%$ Top-1 ImageNet-1K accuracy in only 20 epochs. By preserving this performance while eliminating the global reduction bottleneck, our approach establishes a highly favourable trade-off between arithmetic complexity and off-chip memory traffic, removing a primary barrier to the efficient deployment of ViTs on edge accelerators.
翻译:视觉Transformer(ViTs)在具有挑战性的视觉任务上取得了最先进的性能,但其在边缘设备上的部署因层归一化带来的计算复杂度和全局规约瓶颈而严重受限。近期方法尝试通过将归一化层替换为硬件友好的标量近似来绕过这一限制。然而,这些同质化替换无法最优适配所有层的行为,且依赖昂贵的模型重训练。在本工作中,我们提出一种高效的硬件感知框架,利用遗传编程(GP)从预训练权重直接进化出异构的、逐层定制的标量函数。结合新颖的训练后重新对齐策略,我们的方法完全免除了从零开始重训练模型的需求。进化得到的表达式能精确逼近目标归一化行为,捕获$91.6\%$的方差($R^2$),而同质化基线仅达$70.2\%$;这使得改进后的架构仅需20个训练周期即可恢复$84.25\%$的ImageNet-1K Top-1准确率。在保持此性能的同时消除全局规约瓶颈,我们的方法在算术复杂度与片外内存流量之间建立了高度有利的权衡,消除了在边缘加速器上高效部署ViTs的主要障碍。