Low-Bandwidth Matrix Multiplication: Faster Algorithms and More General Forms of Sparsity

In prior work, Gupta et al. (SPAA 2022) presented a distributed algorithm for multiplying sparse $n \times n$ matrices, using $n$ computers. They assumed that the input matrices are uniformly sparse -- there are at most $d$ non-zeros in each row and column -- and the task is to compute a uniformly sparse part of the product matrix. Initially each computer knows one row of each input matrix, and eventually each computer needs to know one row of the product matrix. In each communication round each computer can send and receive one $O(\log n)$-bit message. Their algorithm solves this task in $O(d^{1.907})$ rounds, while the trivial bound is $O(d^2)$. We improve on the prior work in two dimensions: First, we show that we can solve the same task faster, in only $O(d^{1.832})$ rounds. Second, we explore what happens when matrices are not uniformly sparse. We consider the following alternative notions of sparsity: row-sparse matrices (at most $d$ non-zeros per row), column-sparse matrices, matrices with bounded degeneracy (we can recursively delete a row or column with at most $d$ non-zeros), average-sparse matrices (at most $dn$ non-zeros in total), and general matrices. We show that we can still compute $X = AB$ in $O(d^{1.832})$ rounds even if one of the three matrices ($A$, $B$, or $X$) is average-sparse instead of uniformly sparse. We present algorithms that handle a much broader range of sparsity in $O(d^2 + \log n)$ rounds, and present conditional hardness results that put limits on further improvements and generalizations.

翻译：在先前的工作中，Gupta 等人（SPAA 2022）提出了一种分布式算法，用于在 $n$ 台计算机上计算稀疏 $n \times n$ 矩阵的乘法。他们假设输入矩阵是均匀稀疏的——每行和每列最多有 $d$ 个非零元——并且任务是计算乘积矩阵的均匀稀疏部分。初始时每台计算机知道每个输入矩阵的一行，最终每台计算机需要知道乘积矩阵的一行。在每轮通信中，每台计算机可以发送和接收一条 $O(\log n)$ 比特的消息。他们的算法在 $O(d^{1.907})$ 轮内完成此任务，而平凡界限为 $O(d^2)$。我们在两个维度上改进了先前工作：首先，我们证明可以在更少的轮数内完成相同任务，仅需 $O(d^{1.832})$ 轮。其次，我们探讨了当矩阵并非均匀稀疏时的情形。我们考虑了以下替代的稀疏性概念：行稀疏矩阵（每行最多 $d$ 个非零元）、列稀疏矩阵、有界退化度矩阵（可递归删除一个最多有 $d$ 个非零元的行或列）、平均稀疏矩阵（总计最多 $dn$ 个非零元）以及一般矩阵。我们证明，即使三个矩阵（$A$、$B$ 或 $X$）中是平均稀疏而非均匀稀疏，仍可在 $O(d^{1.832})$ 轮内计算 $X = AB$。我们提出了能在 $O(d^2 + \log n)$ 轮内处理更广泛稀疏性范围的算法，并给出了条件性的困难性结果，限制了进一步的改进和推广。