Random-feature-based attention (RFA) is an efficient approximation of softmax attention with linear runtime and space complexity. However, the approximation gap between RFA and conventional softmax attention is not well studied. Built upon previous progress of RFA, we characterize this gap through the lens of control variates and show that RFA can be decomposed into a sum of multiple control variate estimators for each element in the sequence. This new framework reveals that exact softmax attention can be recovered from RFA by manipulating each control variate. Besides, it allows us to develop a more flexible form of control variates, resulting in a novel attention mechanism that significantly reduces the approximation gap while maintaining linear complexity. Extensive experiments demonstrate that our model outperforms state-of-the-art efficient attention mechanisms on both vision and language tasks.
翻译:基于随机特征的注意力(RFA)是一种具有线性时间和空间复杂度的softmax注意力高效近似方法。然而,RFA与常规softmax注意力之间的近似差距尚未得到充分研究。基于RFA的先前进展,我们从控制变量法的视角刻画了这一差距,并证明RFA可分解为序列中每个元素的多个控制变量估计量之和。这一新框架揭示了通过操控每个控制变量可从RFA中恢复精确的softmax注意力。此外,该方法使我们能够开发更灵活的控制变量形式,从而衍生出一种新型注意力机制,在保持线性复杂度的同时显著缩小了近似差距。大量实验表明,我们的模型在视觉和语言任务上均优于当前最先进的高效注意力机制。