To enhance the computational efficiency of quantized Transformers, we replace the dot-product and Softmax-based attention with an alternative mechanism involving addition and ReLU activation only. This side-steps the expansion to double precision often required by matrix multiplication and avoids costly Softmax evaluations but maintains much of the core functionality of conventional dot-product attention. It can enable more efficient execution and support larger quantized Transformer models on resource-constrained hardware or alternative arithmetic systems like homomorphic encryption. Training experiments on four common benchmark tasks show test set prediction scores comparable to those of conventional Transformers with dot-product attention. Our scaling experiments also suggest significant computational savings, both in plaintext and under encryption. In particular, we believe that the ReLU and addition-based attention mechanism introduced in this paper may enable privacy-preserving AI applications operating under homomorphic encryption by avoiding the costly multiplication of encrypted variables.
翻译:为提升量化Transformer的计算效率,本文提出一种替代机制:仅通过加法运算与ReLU激活函数取代传统的点积和Softmax注意力计算。该方法规避了矩阵乘法常需的双精度扩展,避免了高成本的Softmax评估,同时保留了传统点积注意力的核心功能。该机制可在资源受限硬件或同态加密等替代算术系统上实现更高效的执行,并支持更大规模的量化Transformer模型。在四个常见基准任务上的训练实验表明,其测试集预测分数与传统点积注意力Transformer相当。缩放实验进一步显示,无论是明文计算还是加密计算环境,该方法均能带来显著的计算资源节约。特别值得关注的是,本文提出的基于ReLU与加法的注意力机制通过避免加密变量的高成本乘法运算,有望推动同态加密下的隐私保护型人工智能应用发展。