Large language models (LLMs) predominantly employ decoder-only transformer architectures, necessitating the retention of keys/values information for historical tokens to provide contextual information and avoid redundant computation. However, the substantial size and parameter volume of these LLMs require massive GPU memory. This memory demand increases with the length of the input text, leading to an urgent need for more efficient methods of information storage and processing. This study introduces the Anchor-based LLM (AnLLM), which utilizes an innovative anchor-based self-attention network (AnSAN) and also an anchor-based inference strategy. This approach enables LLMs to compress sequence information into an anchor token, reducing the keys/values cache and enhancing inference efficiency. Experiments show that the AnLLM maintains comparable accuracy with up to 99% keys/values cache reduction and up to 3.5 times faster inference. Despite a minor compromise in accuracy, the AnLLM significantly improves computational efficiency and resource utilization, demonstrating the potential of the anchor-based attention approach in the context of LLMs for real-time inference in practical applications.
翻译:大型语言模型(LLMs)主要采用仅解码器的Transformer架构,需要通过保留历史令牌的键/值信息来提供上下文并避免冗余计算。然而,这些大语言模型的庞大规模和参数量需要巨大的GPU内存。这一内存需求随输入文本长度增加而增长,迫切需要更高效的信息存储与处理方法。本研究提出基于锚点的大语言模型(AnLLM),采用创新的基于锚点的自注意力网络(AnSAN)及相应的基于锚点的推理策略。该方法使LLMs能够将序列信息压缩至锚点令牌中,从而减少键/值缓存并提升推理效率。实验表明,AnLLM在压缩高达99%的键/值缓存且推理速度提升至多3.5倍的情况下,仍能保持可比精度。尽管在精度上略有折衷,AnLLM显著提升了计算效率与资源利用率,展示了基于锚点的注意力方法在LLM实际实时推理应用中的潜力。