We present Neural Attention Search (NAtS), a framework that automatically evaluates the importance of each token within a sequence and determines if the corresponding token can be dropped after several steps. This approach can efficiently reduce the KV cache sizes required by transformer-based models during inference and thus reduce inference costs. In this paper, we design a search space that contains three token types: (i) Global Tokens will be preserved and queried by all the following tokens. (ii) Local Tokens survive until the next global token appears. (iii) Sliding Window Tokens have an impact on the inference of a fixed size of the next following tokens. Similar to the One-Shot Neural Architecture Search approach, this token-type information can be learned jointly with the architecture weights via a learnable attention mask. Experiments on both training a new transformer from scratch and fine-tuning existing large language models show that NAtS can efficiently reduce the KV cache size required for the models while maintaining the models' performance.
翻译:本文提出神经注意力搜索(NAtS)框架,该框架能够自动评估序列中每个标记的重要性,并确定在经过若干步骤后是否可以丢弃相应的标记。该方法能够有效减少基于Transformer的模型在推理过程中所需的KV缓存大小,从而降低推理成本。在本文中,我们设计了一个包含三种标记类型的搜索空间:(i)全局标记将被保留,并由所有后续标记查询。(ii)局部标记将存活至下一个全局标记出现。(iii)滑动窗口标记仅对后续固定数量的标记的推理产生影响。类似于一次性神经架构搜索方法,这种标记类型信息可以通过可学习的注意力掩码与架构权重联合学习。在从头训练新Transformer模型以及对现有大型语言模型进行微调的实验表明,NAtS能够在保持模型性能的同时,有效减少模型所需的KV缓存大小。