We introduce WildCat, a high-accuracy, low-cost approach to compressing the attention mechanism in neural networks. While attention is a staple of modern network architectures, it is also notoriously expensive to deploy due to resource requirements that scale quadratically with the input sequence length $n$. WildCat avoids these quadratic costs by only attending over a small weighted coreset. Crucially, we select the coreset using a fast but spectrally-accurate subsampling algorithm -- randomly pivoted Cholesky -- and weight the elements optimally to minimise reconstruction error. Remarkably, given bounded inputs, WildCat approximates exact attention with super-polynomial $O(n^{-\sqrt{\log(\log(n))}})$ error decay while running in near-linear $O(n^{1+o(1)})$ time. In contrast, prior practical approximations either lack error guarantees or require quadratic runtime to guarantee such high fidelity. We couple this advance with a GPU-optimized PyTorch implementation and a suite of benchmark experiments demonstrating the benefits of WildCat for image generation, image classification, and language model KV cache compression.
翻译:本文提出WildCat,一种高精度、低成本的神经网络注意力机制压缩方法。注意力机制作为现代网络架构的核心组件,因其计算资源需求随输入序列长度$n$呈二次方增长而难以实际部署。WildCat通过仅对小型加权核心集进行注意力计算来规避二次方复杂度。我们采用快速且谱精度保持的随机主元Cholesky子采样算法选取核心集,并通过优化权重配置最小化重构误差。值得注意的是,在输入有界条件下,WildCat能以超多项式误差衰减率$O(n^{-\sqrt{\log(\log(n))}})$逼近精确注意力,同时保持近乎线性的$O(n^{1+o(1)})$时间复杂度。相比之下,现有实用化近似方法要么缺乏误差保证,要么需要二次方时间复杂度才能实现同等精度保障。我们将该算法与GPU优化的PyTorch实现相结合,通过图像生成、图像分类及语言模型KV缓存压缩等基准实验验证了WildCat的优越性。