We introduce WildCat, a high-accuracy, low-cost approach to compressing the attention mechanism in neural networks. While attention is a staple of modern network architectures, it is also notoriously expensive to deploy due to resource requirements that scale quadratically with the input sequence length $n$. WildCat avoids these quadratic costs by only attending over a small weighted coreset. Crucially, we select the coreset using a fast but spectrally-accurate subsampling algorithm -- randomly pivoted Cholesky -- and weight the elements optimally to minimise reconstruction error. Remarkably, given bounded inputs, WildCat approximates exact attention with super-polynomial $O(n^{-\sqrt{\log(\log(n))}})$ error decay while running in near-linear $O(n^{1+o(1)})$ time. In contrast, prior practical approximations either lack error guarantees or require quadratic runtime to guarantee such high fidelity. We couple this advance with a GPU-optimized PyTorch implementation and a suite of benchmark experiments demonstrating the benefits of WildCat for image generation, image classification, and language model KV cache compression.
翻译:我们提出WildCat,一种高精度、低成本压缩神经网络注意力机制的方法。尽管注意力是现代网络架构的核心组件,但其部署成本极高,因为资源需求随输入序列长度 $n$ 呈二次方增长。WildCat 通过仅关注一个轻量加权核心集来避免这些二次开销。关键在于,我们采用一种快速但频谱精确的子采样算法——随机枢轴Cholesky——来选取核心集,并通过最优加权最小化重建误差。值得注意的是,给定有界输入,WildCat 能以超多项式 $O(n^{-\sqrt{\log(\log(n))}})$ 的误差衰减逼近精确注意力,同时运行在近线性 $O(n^{1+o(1)})$ 时间内。相比之下,先前的实用近似方法要么缺乏误差保证,要么需要二次运行时才能保证如此高的保真度。我们将这一进展与GPU优化的PyTorch实现以及一套基准实验相结合,展示了WildCat在图像生成、图像分类及语言模型KV缓存压缩中的优势。