Random feature attention (RFA) adopts random fourier feature (RFF) methods to approximate the softmax function, resulting in a linear time and space attention mechanism that enables the construction of an efficient Transformer. Inspired by RFA, we propose Macformer, a Transformer architecture that employs random Maclaurin features (RMF) to approximate various dot-product kernels, thereby accelerating attention computations for long sequence. Macformer consists of Random Maclaurin Feature Attention (RMFA) and pre-post Scaling Batch Normalization (ppSBN), the former is an unbiased approximation for dot-product kernelized attention and the later is a two-stage regularization mechanism guaranteeing the error of RMFA. We conducted toy experiments to demonstrate the efficiency of RMFA and ppSBN, and experiments on long range arena (LRA) benchmark to validate the acceleration and accuracy of Macformer with different dot-product kernels. Experiment results of Macformer are consistent with our theoretical analysis.
翻译:随机特征注意力(RFA)采用随机傅里叶特征(RFF)方法近似softmax函数,从而形成一种线性时间和空间复杂度的注意力机制,使得构建高效Transformer成为可能。受RFA启发,我们提出Macformer,这是一种采用随机麦克劳林特征(RMF)来近似各类点积核的Transformer架构,从而加速长序列的注意力计算。Macformer由随机麦克劳林特征注意力(RMFA)和前后缩放批归一化(ppSBN)组成,前者是对点积核化注意力的无偏近似,后者是一种两阶段正则化机制,用于保证RMFA的近似误差。我们通过玩具实验验证了RMFA和ppSBN的有效性,并在长距离竞技场(LRA)基准测试上进行了实验,以验证采用不同点积核的Macformer在加速性能和精度方面的表现。Macformer的实验结果与我们的理论分析一致。