Self-attention is central to the success of Transformer architectures; however, learning the query, key, and value projections from random initialization remains challenging and computationally expensive. In this paper, we propose two complementary methods that leverage the Discrete Cosine Transform (DCT) to enhance the efficiency and performance of Vision Transformers. First, we address the initialization problem by introducing a simple yet effective DCT-based initialization strategy for self-attention, where projection weights are initialized using DCT coefficients. This structure-preserving approach consistently improves classification accuracy on the CIFAR-10 and ImageNet-1K benchmarks. Second, we propose a DCT-based attention compression technique that exploits the decorrelation properties of the frequency domain. By observing that high-frequency DCT coefficients typically correspond to noise, we truncate high-frequency components of the input patches, thereby reducing the dimensionality of the query, key, and value projections without sacrificing accuracy. Experiments on Swin Transformer models demonstrate that the proposed compression method achieves a substantial reduction in computational overhead while maintaining comparable performance.
翻译:自注意力机制是Transformer架构成功的关键;然而,从随机初始化中学习查询、键和值投影仍然具有挑战性且计算成本高昂。本文提出两种互补方法,利用离散余弦变换(DCT)提升视觉Transformer的效率和性能。首先,为解决初始化问题,我们引入一种简单而有效的基于DCT的自注意力初始化策略,其中投影权重使用DCT系数进行初始化。这种保结构的初始化方法在CIFAR-10和ImageNet-1K基准测试中持续提升了分类准确率。其次,我们提出一种基于DCT的注意力压缩技术,利用频域的去相关特性。通过观察高频DCT系数通常对应噪声,我们截断输入块的高频成分,从而在不牺牲准确率的前提下降低查询、键和值投影的维度。在Swin Transformer模型上的实验表明,所提出的压缩方法在保持相当性能的同时实现了计算开销的大幅降低。