Transformer-based large language models (LLMs) encounter challenges in processing long sequences on edge devices due to the quadratic complexity of attention mechanisms and growing memory demands from Key-Value (KV) cache. Existing KV cache optimizations struggle with irreversible token eviction in long-output tasks, while alternative sequence modeling architectures prove costly to adopt within established Transformer infrastructure. We present EdgeInfinite, a memory-efficient solution for infinite contexts that integrates compressed memory into Transformer-based LLMs through a trainable memory-gating module. This approach maintains full compatibility with standard Transformer architectures, requiring fine-tuning only a small part of parameters, and enables selective activation of the memory-gating module for long and short context task routing. The experimental result shows that EdgeInfinite achieves comparable performance to baseline Transformer-based LLM on long context benchmarks while optimizing memory consumption and time to first token.
翻译:基于Transformer的大语言模型(LLM)在处理长序列时,由于注意力机制的二次复杂度以及键值(KV)缓存不断增长的内存需求,在边缘设备上面临挑战。现有的KV缓存优化方法难以应对长输出任务中不可逆的令牌逐出问题,而替代的序列建模架构在现有Transformer基础设施中采用的成本高昂。本文提出EdgeInfinite,一种面向无限上下文的高内存效率解决方案,它通过可训练的记忆门控模块将压缩记忆集成到基于Transformer的LLM中。该方法与标准Transformer架构完全兼容,仅需微调少量参数,并可通过选择性激活记忆门控模块来路由长、短上下文任务。实验结果表明,EdgeInfinite在长上下文基准测试中实现了与基线Transformer-based LLM相当的性能,同时优化了内存消耗和首令牌生成时间。