A more accurate rational non-commutative algorithm for multiplying 4x4 matrices using 48 multiplications

We propose a more accurate variant of an algorithm for multiplying 4x4 matrices using 48 multiplications over any ring containing an inverse of 2. This algorithm has an error bound exponent of only log 4 $γ$$\infty$,2 $\approx$ 2.386. It also reaches a better accuracy w.r.t. max-norm in practice, when compared to previously known such fast algorithms. Furthermore, we propose a straight line program of this algorithm, giving a leading constant in its complexity bound of 387 32 n 2+log 4 3 + o n 2+log 4 3 operations over any ring containing an inverse of 2. Introduction: An algorithm to multiply two 4x4 complex-valued matrices requiring only 48 non-commutative multiplications was introduced in [16] 1 using a pipeline of large language models orchestrated by an evolutionary coding agent. A matrix multiplication algorithm with that many non-commutative multiplications is denoted by ___4x4x4:48___ in the sequel. An equivalent variant of the associated tensor decomposition defining this algorithm, but over the rationals (more precisely over any ring containing an inverse of 2), was then given in [8]. Most error analysis of sub-cubic time matrix multiplication algorithms [3, 4, 2, 1, 17] are given in the max-norm setting: bounding the largest output error as a function of the max-norm product of the vectors of input matrix coefficients. In this setting, Strassen's algorithm has shown the best accuracy bound, (proven minimal under some assumptions in [2]). In [6, 8], the authors relaxed this setting by shifting the focus to the 2-norm for input and/or output; that allowed them to propose a ___2x2x2:7___ variant with an improved accuracy bound. Experiments show that this variant performs best even when measuring the max-norm of the error bound. We present in this note a variant of the recent ___4x4x4:48___ algorithm over the rationals (again in the same orbit under De Groot isotropies [10]) that is more numerically accurate w.r.t. max-norm in practice. In particular, our new variant improves on the error bound exponent, from log 2 $γ$ $\infty$,2 $\approx$ 2.577 Consider the product of an M x K matrix A by a K x N matrix B. It is computed by a ___m, k, n___ algorithm represented by the matrices L, R, P applied recursively on ${\ell}$ recursive levels and the resulting m 0 x k 0 by k 0 x n 0 products are performed using an algorithm $β$. Here M = m 0 m ${\ell}$ , K = k 0 k ${\ell}$ and n = n 0 n ${\ell}$ . The accuracy bound below uses any (possibly different) p-norms and q-norms for its left-handside, ___$\bullet$___ p and right-hand side, ___$\bullet$___ q . The associated dual norms, are denoted by ___$\bullet$___ p $\star$ and ___$\bullet$___ q $\star$ respectively. Note that, these are vector norms, hence ___A___ p for matrix A in R mxn denotes ___Vect(A)___ p and is the p-norm of the mn dimensional vector of its coefficients, and not a matrix norm.

翻译：我们提出了一种在包含2的逆元的任意环上使用48次乘法计算4×4矩阵乘法的算法变体，该变体具有更高的精度。此算法的误差界指数仅为log 4 γ∞,2 ≈ 2.386。与先前已知的同类快速算法相比，它在实际应用中关于最大范数达到了更好的精度。此外，我们给出了该算法的一条直线程序，其在包含2的逆元的任意环上的复杂度界的前导常数为387 32 n 2+log 4 3 + o n 2+log 4 3 次操作。引言：文献[16]1中引入了一种仅需48次非交换乘法即可计算两个4×4复值矩阵乘积的算法，该算法通过由进化编码代理协调的大型语言模型流水线生成。具有如此多非交换乘法的矩阵乘法算法在下文中记为 ___4×4×4:48___。随后，文献[8]给出了定义该算法的相关张量分解在有理数域（更精确地，在包含2的逆元的任意环上）上的等价变体。大多数次三次时间矩阵乘法算法的误差分析[3,4,2,1,17]都是在最大范数框架下进行的：将最大输出误差界定为输入矩阵系数向量的最大范数乘积的函数。在此框架下，Strassen算法展示了最佳的精度界（在文献[2]的某些假设下被证明是最小的）。在文献[6,8]中，作者通过将关注点转向输入和/或输出的2-范数，放宽了这一框架；这使他们能够提出一种具有改进精度界的 ___2×2×2:7___ 变体。实验表明，即使在测量误差界的最大范数时，该变体表现也最优。本文我们提出了近期 ___4×4×4:48___ 算法在有理数域上（同样在De Groot各向同性群[10]的同一轨道内）的一种变体，该变体在实际应用中关于最大范数具有更高的数值精度。特别地，我们的新变体将误差界指数从log 2 γ∞,2 ≈ 2.577 改进。考虑一个M×K矩阵A与一个K×N矩阵B的乘积。该乘积由一个 ___m, k, n___ 算法计算，该算法由矩阵L、R、P表示，在ℓ个递归层级上递归应用，最终的 m₀×k₀ 与 k₀×n₀ 乘积使用算法β执行。其中 M = m₀m^ℓ, K = k₀k^ℓ, N = n₀n^ℓ。下面的精度界对其左侧 ___●___ p 范数和右侧 ___●___ q 范数使用（可能不同的）p-范数和q-范数。相应的对偶范数分别记为 ___●___ p^* 和 ___●___ q^*。需要注意的是，这些是向量范数，因此矩阵A ∈ R^{m×n} 的 ___A___ p 范数表示 ___Vect(A)___ p，即其系数构成的mn维向量的p-范数，而非矩阵范数。