Deploying Large Language Models (LLMs) in streaming applications that involve long contexts, particularly for extended dialogues and text analysis, is of paramount importance but presents two significant challenges. Firstly, the memory consumption is substantial during the decoding phase due to the caching of Key and Value states (KV) of previous tokens. Secondly, attention computation is time-consuming with a time complexity of $O(n^2)$ for the generation of each token. In recent OpenAI DevDay (Nov 6, 2023), OpenAI released a new model that is able to support a 128K-long document, in our paper, we focus on the memory-efficient issue when context length $n$ is much greater than 128K ($n \gg 2^d$). Considering a single-layer self-attention with Query, Key, and Value matrices $Q, K, V \in \mathbb{R}^{n \times d}$, the polynomial method approximates the attention output $T \in \mathbb{R}^{n \times d}$. It accomplishes this by constructing $U_1, U_2 \in \mathbb{R}^{n \times t}$ to expedite attention ${\sf Attn}(Q, K, V)$ computation within $n^{1+o(1)}$ time executions. Despite this, storing the Key and Value matrices $K, V \in \mathbb{R}^{n \times d}$ still necessitates $O( n d)$ space, leading to significant memory usage. In response to these challenges, we introduce a new algorithm that only reads one pass of the data in streaming fashion. This method employs sublinear space $o(n)$ to store three sketch matrices, alleviating the need for exact $K, V$ storage. Notably, our algorithm exhibits exceptional memory-efficient performance with super-long tokens. As the token length $n$ increases, our error guarantee diminishes while the memory usage remains nearly constant. This unique attribute underscores the potential of our technique in efficiently handling LLMs in streaming applications.
翻译:在涉及长上下文的流式应用(特别是扩展对话和文本分析)中部署大型语言模型(LLMs)至关重要,但面临两大挑战。首先,解码阶段由于缓存先前Token的键值状态(KV),内存消耗巨大。其次,每个Token生成的注意力计算时间复杂度为$O(n^2)$,耗时长。在近期OpenAI DevDay(2023年11月6日)上,OpenAI发布了支持128K长文档的新模型。本文聚焦于上下文长度$n$远大于128K($n \gg 2^d$)时的内存高效问题。考虑单层自注意力机制,查询、键和值矩阵为$Q, K, V \in \mathbb{R}^{n \times d}$,多项式方法可近似注意力输出$T \in \mathbb{R}^{n \times d}$。该方法通过构造$U_1, U_2 \in \mathbb{R}^{n \times t}$加速注意力${\sf Attn}(Q, K, V)$计算,时间复杂度为$n^{1+o(1)}$。尽管如此,存储键值矩阵$K, V \in \mathbb{R}^{n \times d}$仍需$O(n d)$空间,导致严重内存消耗。针对这些挑战,我们提出一种新算法,仅需以流式方式单遍读取数据。该方法使用亚线性空间$o(n)$存储三个草图矩阵,避免了精确存储$K, V$的需求。值得注意的是,我们的算法在处理超长Token时展现出卓越的内存高效性能。随着Token长度$n$增加,误差保证减弱,但内存使用几乎保持恒定。这一独特属性凸显了该技术在流式应用中高效处理LLMs的潜力。