Coded Computing for Resilient Distributed Computing: A Learning-Theoretic Framework

Coded computing has emerged as a promising framework for tackling significant challenges in large-scale distributed computing, including the presence of slow, faulty, or compromised servers. In this approach, each worker node processes a combination of the data, rather than the raw data itself. The final result then is decoded from the collective outputs of the worker nodes. However, there is a significant gap between current coded computing approaches and the broader landscape of general distributed computing, particularly when it comes to machine learning workloads. To bridge this gap, we propose a novel foundation for coded computing, integrating the principles of learning theory, and developing a framework that seamlessly adapts with machine learning applications. In this framework, the objective is to find the encoder and decoder functions that minimize the loss function, defined as the mean squared error between the estimated and true values. Facilitating the search for the optimum decoding and functions, we show that the loss function can be upper-bounded by the summation of two terms: the generalization error of the decoding function and the training error of the encoding function. Focusing on the second-order Sobolev space, we then derive the optimal encoder and decoder. We show that in the proposed solution, the mean squared error of the estimation decays with the rate of $\mathcal{O}(S^3 N^{-3})$ and $\mathcal{O}(S^{\frac{8}{5}}N^{\frac{-3}{5}})$ in noiseless and noisy computation settings, respectively, where $N$ is the number of worker nodes with at most $S$ slow servers (stragglers). Finally, we evaluate the proposed scheme on inference tasks for various machine learning models and demonstrate that the proposed framework outperforms the state-of-the-art in terms of accuracy and rate of convergence.

翻译：编码计算作为应对大规模分布式计算中关键挑战（包括慢速、故障或受损服务器的存在）的一种有前景的框架而崭露头角。在该方法中，每个工作节点处理的是数据的组合版本而非原始数据本身，最终结果则从工作节点的集体输出中解码得出。然而，当前的编码计算方法与更广泛的通用分布式计算领域之间存在显著差距，特别是在机器学习工作负载方面。为弥合这一差距，我们提出了一种新的编码计算基础，融合学习理论原理，并开发了一个能与机器学习应用无缝适配的框架。在该框架中，目标是寻找能最小化损失函数的编码器和解码器函数，其中损失函数定义为估计值与真实值之间的均方误差。为便于寻找最优解码和编码函数，我们证明损失函数可被两个项的和上界化：解码函数的泛化误差和编码函数的训练误差。随后，我们聚焦于二阶索博列夫空间，推导出最优编码器和解码器。我们证明，在所提出的解决方案中，估计的均方误差在无噪声和有噪声计算设置下分别以$\mathcal{O}(S^3 N^{-3})$和$\mathcal{O}(S^{\frac{8}{5}}N^{\frac{-3}{5}})$的速率衰减，其中$N$是工作节点数量，且最多存在$S$个慢速服务器（掉队者）。最后，我们在多种机器学习模型的推理任务上评估了所提出的方案，并证明该框架在准确性和收敛速率方面均优于现有最先进方法。