Large language models (LLMs) have played a pivotal role in revolutionizing various facets of our daily existence. Solving attention regression is a fundamental task in optimizing LLMs. In this work, we focus on giving a provable guarantee for the one-layer attention network objective function $L(X,Y) = \sum_{j_0 = 1}^n \sum_{i_0 = 1}^d ( \langle \langle \exp( \mathsf{A}_{j_0} x ) , {\bf 1}_n \rangle^{-1} \exp( \mathsf{A}_{j_0} x ), A_{3} Y_{*,i_0} \rangle - b_{j_0,i_0} )^2$. Here $\mathsf{A} \in \mathbb{R}^{n^2 \times d^2}$ is Kronecker product between $A_1 \in \mathbb{R}^{n \times d}$ and $A_2 \in \mathbb{R}^{n \times d}$. $A_3$ is a matrix in $\mathbb{R}^{n \times d}$, $\mathsf{A}_{j_0} \in \mathbb{R}^{n \times d^2}$ is the $j_0$-th block of $\mathsf{A}$. The $X, Y \in \mathbb{R}^{d \times d}$ are variables we want to learn. $B \in \mathbb{R}^{n \times d}$ and $b_{j_0,i_0} \in \mathbb{R}$ is one entry at $j_0$-th row and $i_0$-th column of $B$, $Y_{*,i_0} \in \mathbb{R}^d$ is the $i_0$-column vector of $Y$, and $x \in \mathbb{R}^{d^2}$ is the vectorization of $X$. In a multi-layer LLM network, the matrix $B \in \mathbb{R}^{n \times d}$ can be viewed as the output of a layer, and $A_1= A_2 = A_3 \in \mathbb{R}^{n \times d}$ can be viewed as the input of a layer. The matrix version of $x$ can be viewed as $QK^\top$ and $Y$ can be viewed as $V$. We provide an iterative greedy algorithm to train loss function $L(X,Y)$ up $\epsilon$ that runs in $\widetilde{O}( ({\cal T}_{\mathrm{mat}}(n,n,d) + {\cal T}_{\mathrm{mat}}(n,d,d) + d^{2\omega}) \log(1/\epsilon) )$ time. Here ${\cal T}_{\mathrm{mat}}(a,b,c)$ denotes the time of multiplying $a \times b$ matrix another $b \times c$ matrix, and $\omega\approx 2.37$ denotes the exponent of matrix multiplication.
翻译:大语言模型(LLMs)在革新人类日常生活各方面发挥了关键作用。解决注意力回归问题是优化大语言模型的基础任务。本文旨在为单层注意力网络的目标函数 $L(X,Y) = \sum_{j_0 = 1}^n \sum_{i_0 = 1}^d ( \langle \langle \exp( \mathsf{A}_{j_0} x ) , {\bf 1}_n \rangle^{-1} \exp( \mathsf{A}_{j_0} x ), A_{3} Y_{*,i_0} \rangle - b_{j_0,i_0} )^2$ 提供可证明的保证。其中 $\mathsf{A} \in \mathbb{R}^{n^2 \times d^2}$ 是 $A_1 \in \mathbb{R}^{n \times d}$ 与 $A_2 \in \mathbb{R}^{n \times d}$ 的克罗内克积,$A_3$ 是 $\mathbb{R}^{n \times d}$ 矩阵,$\mathsf{A}_{j_0} \in \mathbb{R}^{n \times d^2}$ 是 $\mathsf{A}$ 的第 $j_0$ 个分块。$X, Y \in \mathbb{R}^{d \times d}$ 为待学习变量,$B \in \mathbb{R}^{n \times d}$ 中 $b_{j_0,i_0} \in \mathbb{R}$ 表示第 $j_0$ 行第 $i_0$ 列元素,$Y_{*,i_0} \in \mathbb{R}^d$ 为 $Y$ 的第 $i_0$ 列向量,$x \in \mathbb{R}^{d^2}$ 为 $X$ 的向量化。在多层级联的LLM网络中,矩阵 $B \in \mathbb{R}^{n \times d}$ 可视为某一层的输出,而 $A_1= A_2 = A_3 \in \mathbb{R}^{n \times d}$ 可视为该层的输入。$x$ 的矩阵形式可表示为 $QK^\top$,$Y$ 可表示为 $V$。本文提出一种迭代贪心算法,可在 $\widetilde{O}( ({\cal T}_{\mathrm{mat}}(n,n,d) + {\cal T}_{\mathrm{mat}}(n,d,d) + d^{2\omega}) \log(1/\epsilon) )$ 时间内将损失函数 $L(X,Y)$ 训练至 $\epsilon$ 精度,其中 ${\cal T}_{\mathrm{mat}}(a,b,c)$ 表示 $a \times b$ 矩阵与 $b \times c$ 矩阵的乘法时间,$\omega\approx 2.37$ 为矩阵乘法指数。