The problem of efficient approximation of a linear operator induced by the Gaussian or softmax kernel is often addressed using random features (RFs) which yield an unbiased approximation of the operator's result. Such operators emerge in important applications ranging from kernel methods to efficient Transformers. We propose parameterized, positive, non-trigonometric RFs which approximate Gaussian and softmax-kernels. In contrast to traditional RF approximations, parameters of these new methods can be optimized to reduce the variance of the approximation, and the optimum can be expressed in closed form. We show that our methods lead to variance reduction in practice ($e^{10}$-times smaller variance and beyond) and outperform previous methods in a kernel regression task. Using our proposed mechanism, we also present FAVOR#, a method for self-attention approximation in Transformers. We show that FAVOR# outperforms other random feature methods in speech modelling and natural language processing.
翻译:高斯核或softmax核诱导的线性算子高效近似问题,常通过随机特征方法解决,该方法可产生算子结果的无偏近似。此类算子广泛应用于核方法至高效Transformer等重要领域。本文提出参数化、正定、非三角形式的随机特征,用于近似高斯核与softmax核。与传统随机特征方法不同,这些新方法的参数可优化以降低近似方差,且最优解可表达为闭式解析形式。我们证明该方法在实践中可实现方差降低(方差减小10¹⁰倍以上),并在核回归任务中超越先前方法。基于所提机制,我们进一步提出FAVOR#——一种用于Transformer自注意力近似的方法。实验表明FAVOR#在语音建模与自然语言处理任务中优于其他随机特征方法。