General Coded Computing in a Probabilistic Straggler Regime

Coded computing has demonstrated promising results in addressing straggler resiliency in distributed computing systems. However, most coded computing schemes are designed for exact computation, requiring the number of responding servers to exceed a certain recovery threshold. Additionally, these schemes are tailored for highly structured functions. Recently, new coded computing schemes for general computing functions, where exact computation is replaced with approximate computation, have emerged. In these schemes, the availability of additional results corresponds to more accurate estimation of computational tasks. This flexibility introduces new questions that need to be addressed. This paper addresses the practically important scenario in the context of general coded computing, where each server may become a straggler with a probability $p$, independently from others. We theoretically analyze the approximation error of two existing general coded computing schemes: Berrut Approximate Coded Computing (BACC) and Learning Theoretic Coded Computing (LeTCC). Under the probabilistic straggler configuration, we demonstrate that the average approximation error for BACC and LeTCC converge to zero with the rate of at least $\mathcal{O}(\log^3_{\frac{1}{p}}(N)\cdot{N^{-3}})$ and $\mathcal{O}(\log^4_{\frac{1}{p}}(N)\cdot{N^{-2}})$, respectively. This is perhaps surprising, as earlier results does not indicate a convergence when the number of stragglers scales with the total number of servers $N$. However, in this case, despite the average number of stragglers being $Np$, the independence of servers in becoming stragglers allows the approximation error to converge to zero. These theoretical results are validated through experiments on various computing functions, including deep neural networks.

翻译：编码计算已在分布式计算系统中展现出应对拖延者问题的显著效果。然而，多数编码计算方案针对精确计算设计，要求响应服务器数量超过特定恢复阈值。此外，这些方案仅适用于高度结构化的函数。近年来，面向一般计算函数的新型编码计算方案应运而生，将精确计算替换为近似计算。在此类方案中，额外计算结果的获取可提升计算任务的估计精度，这种灵活性带来了需要解决的新问题。本文针对一般性编码计算中的实际重要场景展开研究：每个服务器独立地以概率$p$成为拖延者。我们理论分析了两种现有一般性编码计算方案——伯鲁特近似编码计算（BACC）与学习理论编码计算（LeTCC）——的近似误差。在概率性拖延者配置下，我们证明BACC和LeTCC的平均近似误差分别以至少$\mathcal{O}(\log^3_{\frac{1}{p}}(N)\cdot{N^{-3}})$和$\mathcal{O}(\log^4_{\frac{1}{p}}(N)\cdot{N^{-2}})$的速率收敛至零。这一结果可能令人意外，因为先前的研究并未表明当拖延者数量随服务器总数$N$增长时存在收敛性。然而，在本场景中尽管平均拖延者数量为$Np$，但服务器成为拖延者的独立性使得近似误差仍能收敛至零。这些理论结果通过包括深度神经网络在内的多种计算函数实验得到了验证。