General Coded Computing in a Probabilistic Straggler Regime

Coded computing has demonstrated promising results in addressing straggler resiliency in distributed computing systems. However, most coded computing schemes are designed for exact computation, requiring the number of responding servers to exceed a certain recovery threshold. Additionally, these schemes are tailored for highly structured functions. Recently, new coded computing schemes for general computing functions, where exact computation is replaced with approximate computation, have emerged. In these schemes, the availability of additional results corresponds to more accurate estimation of computational tasks. This flexibility introduces new questions that need to be addressed. This paper addresses the practically important scenario in the context of general coded computing, where each server may become a straggler with a probability $p$, independently from others. We theoretically analyze the approximation error of two existing general coded computing schemes: Berrut Approximate Coded Computing (BACC) and Learning Theoretic Coded Computing (LeTCC). Under the probabilistic straggler configuration, we demonstrate that the average approximation error for BACC and LeTCC converge to zero with the rate of at least $\mathcal{O}(\log^3_{\frac{1}{p}}(N)\cdot{N^{-3}})$ and $\mathcal{O}(\log^4_{\frac{1}{p}}(N)\cdot{N^{-2}})$, respectively. This is perhaps surprising, as earlier results does not indicate a convergence when the number of stragglers scales with the total number of servers $N$. However, in this case, despite the average number of stragglers being $Np$, the independence of servers in becoming stragglers allows the approximation error to converge to zero. These theoretical results are validated through experiments on various computing functions, including deep neural networks.

翻译：编码计算在解决分布式计算系统中的掉队弹性方面已展现出有前景的结果。然而，大多数编码计算方案是为精确计算设计的，要求响应服务器的数量超过特定的恢复阈值。此外，这些方案是为高度结构化的函数量身定制的。最近，出现了针对通用计算函数的新型编码计算方案，其中精确计算被近似计算所取代。在这些方案中，额外结果的可用性对应于计算任务更精确的估计。这种灵活性引入了需要解决的新问题。本文在通用编码计算的背景下，探讨了一个具有重要实际意义的场景：每个服务器可能以概率$p$独立于其他服务器成为掉队者。我们从理论上分析了两种现有通用编码计算方案的近似误差：Berrut近似编码计算（BACC）和学习理论编码计算（LeTCC）。在概率性掉队配置下，我们证明BACC和LeTCC的平均近似误差分别以至少$\mathcal{O}(\log^3_{\frac{1}{p}}(N)\cdot{N^{-3}})$和$\mathcal{O}(\log^4_{\frac{1}{p}}(N)\cdot{N^{-2}})$的速率收敛到零。这一结果或许令人惊讶，因为早期的研究并未表明当掉队者数量随服务器总数$N$成比例增长时误差会收敛。然而，在此情况下，尽管平均掉队者数量为$Np$，服务器成为掉队者的独立性使得近似误差能够收敛到零。这些理论结果通过在包括深度神经网络在内的多种计算函数上的实验得到了验证。