In an effort to reduce the computational load of Transformers, research on linear attention has gained significant momentum. However, the improvement strategies for attention mechanisms typically necessitate extensive retraining, which is impractical for large language models with a vast array of parameters. In this paper, we present DiJiang, a novel Frequency Domain Kernelization approach that enables the transformation of a pre-trained vanilla Transformer into a linear complexity model with little training costs. By employing a weighted Quasi-Monte Carlo method for sampling, the proposed approach theoretically offers superior approximation efficiency. To further reduce the training computational complexity, our kernelization is based on Discrete Cosine Transform (DCT) operations. Extensive experiments demonstrate that the proposed method achieves comparable performance to the original Transformer, but with significantly reduced training costs and much faster inference speeds. Our DiJiang-7B achieves comparable performance with LLaMA2-7B on various benchmark while requires only about 1/50 training cost. Code is available at https://github.com/YuchuanTian/DiJiang.
翻译:为降低Transformer的计算负担,线性注意力研究已取得显著进展。然而,注意力机制的改进策略通常需要大量重新训练,这对具有海量参数的大语言模型而言并不实用。本文提出DiJiang——一种新颖的频域核化方法,能以极低训练成本将预训练标准Transformer转化为线性复杂度模型。通过采用加权拟蒙特卡罗采样方法,该方法在理论上具有更优的逼近效率。为进一步降低训练计算复杂度,我们的核化基于离散余弦变换(DCT)实现。大量实验表明,所提方法在实现与原Transformer相当性能的同时,显著降低了训练成本并大幅提升推理速度。我们的DiJiang-7B在多个基准测试中达到与LLaMA2-7B相当的性能,但训练成本仅为其约1/50。代码已开源至https://github.com/YuchuanTian/DiJiang。