Algorithm and Hardness for Dynamic Attention Maintenance in Large Language Models

Large language models (LLMs) have made fundamental changes in human life. The attention scheme is one of the key components over all the LLMs, such as BERT, GPT-1, Transformers, GPT-2, 3, 3.5 and 4. Inspired by previous theoretical study of static version of the attention multiplication problem [Zandieh, Han, Daliri, and Karbasi arXiv 2023, Alman and Song arXiv 2023]. In this work, we formally define a dynamic version of attention matrix multiplication problem. There are matrices $Q,K, V \in \mathbb{R}^{n \times d}$, they represent query, key and value in LLMs. In each iteration we update one entry in $K$ or $V$. In the query stage, we receive $(i,j) \in [n] \times [d]$ as input, and want to answer $(D^{-1} A V)_{i,j}$, where $A:=\exp(QK^\top) \in \mathbb{R}^{n \times n}$ is a square matrix and $D := \mathrm{diag}(A {\bf 1}_n) \in \mathbb{R}^{n \times n}$ is a diagonal matrix. Here ${\bf 1}_n$ denote a length-$n$ vector that all the entries are ones. We provide two results: an algorithm and a conditional lower bound. $\bullet$ On one hand, inspired by the lazy update idea from [Demetrescu and Italiano FOCS 2000, Sankowski FOCS 2004, Cohen, Lee and Song STOC 2019, Brand SODA 2020], we provide a data-structure that uses $O(n^{\omega(1,1,\tau)-\tau})$ amortized update time, and $O(n^{1+\tau})$ worst-case query time. $\bullet$ On the other hand, show that unless the hinted matrix vector multiplication conjecture [Brand, Nanongkai and Saranurak FOCS 2019] is false, there is no algorithm that can use both $O(n^{\omega(1,1,\tau) - \tau- \Omega(1)})$ amortized update time, and $O(n^{1+\tau-\Omega(1)})$ worst query time. In conclusion, our algorithmic result is conditionally optimal unless hinted matrix vector multiplication conjecture is false.

翻译：大型语言模型（LLMs）已对人类生活产生根本性变革。注意力机制是BERT、GPT-1、Transformer、GPT-2、3、3.5和4等所有LLMs中的核心组件之一。受先前关于静态注意力乘性问题理论研究的启发[Zandieh, Han, Daliri, and Karbasi arXiv 2023, Alman and Song arXiv 2023]，本文正式定义了注意力矩阵乘法的动态版本。设有矩阵 $Q,K, V \in \mathbb{R}^{n \times d}$，分别表示LLMs中的查询、键和值。在每次迭代中，我们更新 $K$ 或 $V$ 中的一个条目。在查询阶段，输入为 $(i,j) \in [n] \times [d]$，需回答 $(D^{-1} A V)_{i,j}$，其中 $A:=\exp(QK^\top) \in \mathbb{R}^{n \times n}$ 为方阵，$D := \mathrm{diag}(A {\bf 1}_n) \in \mathbb{R}^{n \times n}$ 为对角矩阵。这里 ${\bf 1}_n$ 表示长度为 $n$ 的全1向量。我们给出两项结果：一种算法和一个条件性下界。$\bullet$ 一方面，受[Demetrescu and Italiano FOCS 2000, Sankowski FOCS 2004, Cohen, Lee and Song STOC 2019, Brand SODA 2020]中惰性更新思想的启发，我们提出一种数据结构，其均摊更新时间为 $O(n^{\omega(1,1,\tau)-\tau})$，最坏情况查询时间为 $O(n^{1+\tau})$。$\bullet$ 另一方面，我们证明除非提示矩阵向量乘法猜想[Brand, Nanongkai and Saranurak FOCS 2019]不成立，否则不存在均摊更新时间为 $O(n^{\omega(1,1,\tau) - \tau- \Omega(1)})$ 且最坏查询时间为 $O(n^{1+\tau-\Omega(1)})$ 的算法。总之，除非提示矩阵向量乘法猜想不成立，我们的算法结果在条件性意义下是最优的。

相关内容

Omega

关注 17

在Omega中，资源发放是乐观的(optimistic)，每一个应用都发放了所有的可用的资源，冲突是在提交的时候被解决的。Omega的资源管理器，本质上是一个保存着每一个节点的状态关系数据库，并且用不同的乐观并发控制来解决冲突。这样的好处是其大大的提高了调度器的性能(完全的并行，full parallelism)和资源利用率。

【KDD2022】掩码与推理: 用于复杂逻辑查询的预训练知识图谱Transformers

专知会员服务

29+阅读 · 2022年8月12日

牛津大学《多智能体影响图的均衡优化: 理论和实践》，Equilibrium Refinements for Multi-Agent Influence Diagrams: Theory and Practice

专知会员服务

26+阅读 · 2022年4月10日