Transformers are considered conceptually different from the previous generation of state-of-the-art NLP models - recurrent neural networks (RNNs). In this work, we demonstrate that decoder-only transformers can in fact be conceptualized as unbounded multi-state RNNs - an RNN variant with unlimited hidden state size. We further show that transformers can be converted into $\textit{bounded}$ multi-state RNNs by fixing the size of their hidden state, effectively compressing their key-value cache. We introduce a novel, training-free compression policy - $\textbf{T}$oken $\textbf{O}$mission $\textbf{V}$ia $\textbf{A}$ttention (TOVA). Our experiments with four long range tasks and several LLMs show that TOVA outperforms several baseline compression policies. Particularly, our results are nearly on par with the full model, using in some cases only $\frac{1}{8}$ of the original cache size, which translates to 4.8X higher throughput. Our results shed light on the connection between transformers and RNNs, and help mitigate one of LLMs' most painful computational bottlenecks - the size of their key-value cache. We publicly release our code at https://github.com/schwartz-lab-NLP/TOVA
翻译:Transformer 通常被认为在概念上不同于上一代最先进的 NLP 模型——循环神经网络(RNN)。在本工作中,我们证明仅解码器 Transformer 实际上可被概念化为一种无界多状态 RNN——一种隐藏状态大小不受限制的 RNN 变体。我们进一步表明,通过固定其隐藏状态的大小,Transformer 可以转换为 $\textit{有界}$ 多状态 RNN,这有效地压缩了其键值缓存。我们引入了一种新颖的、无需训练的压缩策略——$\textbf{T}$oken $\textbf{O}$mission $\textbf{V}$ia $\textbf{A}$ttention(TOVA)。我们在四个长序列任务和多个大语言模型(LLM)上的实验表明,TOVA 优于多种基线压缩策略。特别地,我们的结果几乎与完整模型持平,在某些情况下仅使用原始缓存大小的 $\frac{1}{8}$,这转化为 4.8 倍的吞吐量提升。我们的结果揭示了 Transformer 与 RNN 之间的联系,并有助于缓解 LLM 最棘手的计算瓶颈之一——其键值缓存的大小。我们的代码已在 https://github.com/schwartz-lab-NLP/TOVA 公开。