Efficient $1$-bit tensor approximations

We present a spatially efficient decomposition of matrices and arbitrary-order tensors as linear combinations of tensor products of $\{-1, 1\}$-valued vectors. For any matrix $A \in \mathbb{R}^{m \times n}$, $$A - R_w = S_w C_w T_w^\top = \sum_{j=1}^w c_j \cdot \mathbf{s}_j \mathbf{t}_j^\top$$ is a {\it $w$-width signed cut decomposition of $A$}. Here $C_w = "diag"(\mathbf{c}_w)$ for some $\mathbf{c}_w \in \mathbb{R}^w,$ and $S_w, T_w$, and the vectors $\mathbf{s}_j, \mathbf{t}_j$ are $\{-1, 1\}$-valued. To store $(S_w, T_w, C_w)$, we may pack $w \cdot (m + n)$ bits, and require only $w$ floating point numbers. As a function of $w$, $\|R_w\|_F$ exhibits exponential decay when applied to #f32 matrices with i.i.d. $\mathcal N (0, 1)$ entries. Choosing $w$ so that $(S_w, T_w, C_w)$ has the same memory footprint as a \textit{f16} or \textit{bf16} matrix, the relative error is comparable. Our algorithm yields efficient signed cut decompositions in $20$ lines of pseudocode. It reflects a simple modification from a celebrated 1999 paper [1] of Frieze and Kannan. As a first application, we approximate the weight matrices in the open \textit{Mistral-7B-v0.1} Large Language Model to a $50\%$ spatial compression. Remarkably, all $226$ remainder matrices have a relative error $<6\%$ and the expanded model closely matches \textit{Mistral-7B-v0.1} on the {\it huggingface} leaderboard [2]. Benchmark performance degrades slowly as we reduce the spatial compression from $50\%$ to $25\%$. We optimize our open source \textit{rust} implementation [3] with \textit{simd} instructions on \textit{avx2} and \textit{avx512} architectures. We also extend our algorithm from matrices to tensors of arbitrary order and use it to compress a picture of the first author's cat Angus.

翻译：我们提出了一种空间高效的矩阵及任意阶张量的分解方法，将其表示为取值为$\{-1, 1\}$的向量张量积的线性组合。对于任意矩阵$A \in \mathbb{R}^{m \times n}$，$$A - R_w = S_w C_w T_w^\top = \sum_{j=1}^w c_j \cdot \mathbf{s}_j \mathbf{t}_j^\top$$构成$A$的{\it $w$宽度符号割分解}。其中$C_w = "diag"(\mathbf{c}_w)$（$\mathbf{c}_w \in \mathbb{R}^w$），而$S_w$、$T_w$及向量$\mathbf{s}_j$、$\mathbf{t}_j$均取$\{-1, 1\}$值。存储$(S_w, T_w, C_w)$仅需$w \cdot (m + n)$比特和$w$个浮点数。当应用于具有独立同分布$\mathcal N (0, 1)$条目的#f32矩阵时，$\|R_w\|_F$随$w$增大呈现指数衰减。选择使$(S_w, T_w, C_w)$与\textit{f16}或\textit{bf16}矩阵具有相同内存占用的$w$值时，相对误差相当。我们的算法仅用$20$行伪代码即可实现高效符号割分解，这是对Frieze与Kannan 199年经典论文[1]的简易改进。作为首次应用，我们将开源\textit{Mistral-7B-v0.1}大语言模型中的权重矩阵压缩至$50\%$空间占用。值得注意的是，全部$226$个残差矩阵相对误差均$<6\%$，且扩展模型在{\it huggingface}排行榜[2]上与\textit{Mistral-7B-v0.1}高度吻合。当空间压缩率从$50\%$降至$25\%$时，基准性能下降缓慢。我们在\textit{avx2}和\textit{avx512}架构上使用\textit{simd}指令优化了开源\textit{rust}实现[3]。该算法还可从矩阵推广至任意阶张量，我们已将其用于压缩第一作者爱猫Angus的照片。