We study the problem of computing matrix chain multiplications in a distributed computing cluster. In such systems, performance is often limited by the straggler problem, where the slowest worker dominates the overall computation latency. To resolve this issue, several coded computing strategies have been proposed, primarily focusing on the simplest case: the multiplication of two matrices. These approaches successfully alleviate the straggler effect, but they do so at the expense of higher computational complexity and increased storage needs at the workers. However, in many real-world applications, computations naturally involve long chains of matrix multiplications rather than just a single two-matrix product. Extending univariate polynomial coding to this setting has been shown to amplify the costs -- both computation and storage overheads grow significantly, limiting scalability. In this work, we propose two novel multivariate polynomial coding schemes specifically designed for matrix chain multiplication in distributed environments. Our results show that while multivariate codes introduce additional computational cost at the workers, they can dramatically reduce storage overhead compared to univariate extensions. This reveals a fundamental trade-off between computation and storage efficiency, and highlights the potential of multivariate codes as a practical solution for large-scale distributed linear algebra tasks.
翻译:本研究探讨在分布式计算集群中计算矩阵链乘法的问题。在此类系统中,性能通常受到滞后节点问题的限制——最慢的工作节点决定了整体计算延迟。为解决此问题,已有多种编码计算策略被提出,主要集中于最简单的情形:两个矩阵的乘法。这些方法成功缓解了滞后效应,但代价是增加了计算复杂度并提高了工作节点的存储需求。然而,在许多实际应用中,计算天然涉及长链矩阵乘法而非单一的双矩阵乘积。将单变量多项式编码扩展至此场景已被证明会显著放大成本——计算与存储开销均大幅增长,从而限制了可扩展性。本工作提出了两种专为分布式环境中的矩阵链乘法设计的新型多元多项式编码方案。结果表明,虽然多元编码在工作节点引入了额外的计算成本,但与单变量扩展方案相比,其能显著降低存储开销。这揭示了计算效率与存储效率之间的基本权衡,并凸显了多元编码作为大规模分布式线性代数任务实用解决方案的潜力。