ASR: Attention-alike Structural Re-parameterization

The structural re-parameterization (SRP) technique is a novel deep learning technique that achieves interconversion between different network architectures through equivalent parameter transformations. This technique enables the mitigation of the extra costs for performance improvement during training, such as parameter size and inference time, through these transformations during inference, and therefore SRP has great potential for industrial and practical applications. The existing SRP methods have successfully considered many commonly used architectures, such as normalizations, pooling methods, and multi-branch convolution. However, the widely used attention modules which drastically slow inference speed cannot be directly implemented by SRP due to these modules usually act on the backbone network in a multiplicative manner and the modules' output is input-dependent during inference, which limits the application scenarios of SRP. In this paper, we conduct extensive experiments from a statistical perspective and discover an interesting phenomenon Stripe Observation, which reveals that channel attention values quickly approach some constant vectors during training. This observation inspires us to propose a simple-yet-effective attention-alike structural re-parameterization (ASR) that allows us to achieve SRP for a given network while enjoying the effectiveness of the attention mechanism. Extensive experiments conducted on several standard benchmarks demonstrate the effectiveness of ASR in generally improving the performance of existing backbone networks, attention modules, and SRP methods without any elaborated model crafting. We also analyze the limitations and provide experimental and theoretical evidence for the strong robustness of the proposed ASR.

翻译：结构重参数化（SRP）技术是一种新颖的深度学习技术，它通过等效参数变换实现不同网络架构之间的相互转换。该技术能够在推理阶段通过这些变换，缓解训练期间为提升性能而带来的额外开销（如参数量与推理时间），因此SRP在工业与实用场景中具有巨大潜力。现有的SRP方法已成功涵盖多种常用架构，例如归一化、池化方法以及多分支卷积。然而，被广泛使用但会显著降低推理速度的注意力模块，因其通常以乘法方式作用于骨干网络，且模块输出在推理时依赖于输入，而无法直接被SRP实现，这限制了SRP的应用场景。本文从统计学视角进行了大量实验，发现了一个有趣的现象——条纹观测，该现象揭示了通道注意力值在训练过程中会快速趋近于某些常向量。这一观测启发我们提出了一种简单而有效的类注意力结构重参数化（ASR）方法，使得我们能够在实现给定网络SRP的同时，享受到注意力机制的有效性。在多个标准基准测试上进行的大量实验表明，ASR能够普遍提升现有骨干网络、注意力模块及SRP方法的性能，且无需精细的模型设计。我们还分析了其局限性，并通过实验与理论证据证明了所提ASR具有强大的鲁棒性。