The development of large language models (LLMs) has revolutionized automated code generation. However, their high demand of computation resources has hindered a broader deployment and raised environmental concerns. A common strategy for diminishing computational demands is to cache Key-Value (KV) states from the attention mechanism which is adopted predominately by mainstream LLMs. It can mitigate the need of repeated attention computations, but brings significant memory overhead. Current practices in NLP often use sparse attention which may, unfortunately, lead to substantial inaccuracies, or hallucinations, in code generation tasks. In this paper, we analyze the attention weights distribution within code generation models via an empirical study, uncovering a sparsity pattern, i.e., the aggregation of information at specific anchor points. Based on this observation, we propose a novel approach, AnchorCoder, which features token-wise anchor attention designed to extract and compress the contextual information, and layer-wise anchor attention enabling cross-layer communication to mitigate the issue of excessive superposition caused by the compression. The extensive experiments across multiple benchmark datasets confirm the effectiveness of AnchorCoder, which can consistently achieve a significant (at least 70%) reduction in KV cache requirements, while preserving the majority of model's performance.
翻译:大语言模型(LLM)的发展彻底改变了自动化代码生成领域。然而,其对计算资源的高需求阻碍了更广泛的部署,并引发了环境担忧。降低计算需求的常见策略是缓存主流大语言模型普遍采用的自注意力机制中的键值(KV)状态。这可以减少重复注意力计算的需求,但会带来显著的内存开销。当前自然语言处理领域的实践常使用稀疏注意力,但这在代码生成任务中可能导致严重的错误或幻觉。本文通过实证研究分析了代码生成模型中的注意力权重分布,揭示了一种稀疏性模式,即信息在特定锚点处的聚集。基于这一观察,我们提出了一种新颖的方法——AnchorCoder,其特点包括:设计用于提取和压缩上下文信息的词元级锚点注意力,以及实现跨层通信以缓解因压缩导致的过度叠加问题的层级锚点注意力。在多个基准数据集上的广泛实验证实了AnchorCoder的有效性,该方法能够持续实现键值缓存需求的大幅(至少70%)降低,同时保持模型性能的主体部分。