We consider the problem of evaluating arbitrary multivariate polynomials over a massive dataset containing multiple inputs, on a distributed computing system with a master node and multiple worker nodes. Generalized Lagrange Coded Computing (GLCC) codes are proposed to simultaneously provide resiliency against stragglers who do not return computation results in time, security against adversarial workers who deliberately modify results for their benefit, and information-theoretic privacy of the dataset amidst possible collusion of workers. GLCC codes are constructed by first partitioning the dataset into multiple groups, then encoding the dataset using carefully designed interpolating polynomials, and sharing multiple encoded data points to each worker, such that interference computation results across groups can be eliminated at the master. Particularly, GLCC codes include the state-of-the-art Lagrange Coded Computing (LCC) codes as a special case, and exhibit a more flexible tradeoff between communication and computation overheads in optimizing system efficiency. Furthermore, we apply GLCC to distributed training of machine learning models, and demonstrate that GLCC codes achieve a speedup of up to $2.5\text{--}3.9\times$ over LCC codes in training time, across experiments for training image classifiers on different datasets, model architectures, and straggler patterns.
翻译:我们考虑在包含一个主节点和多个工作节点的分布式计算系统上,评估定义于海量多输入数据集上的任意多元多项式的问题。本文提出了广义拉格朗日编码计算(GLCC)码,旨在同时提供以下保障:对未能及时返回计算结果的滞后节点的弹性;对为自身利益故意篡改结果的恶意工作节点的安全性;以及在可能发生工作节点共谋的情况下,数据集的信息论隐私性。GLCC码的构造过程如下:首先将数据集划分为多个组,然后使用精心设计的插值多项式对数据集进行编码,并将多个编码数据点分配给每个工作节点,使得主节点能够消除跨组产生的干扰计算结果。特别地,GLCC码将当前最先进的拉格朗日编码计算(LCC)码作为一个特例包含在内,并在优化系统效率时展现出更灵活的计算与通信开销权衡。此外,我们将GLCC应用于机器学习模型的分布式训练,并通过在不同数据集、模型架构和滞后节点模式下训练图像分类器的实验证明,GLCC码在训练时间上相比LCC码实现了高达$2.5\text{--}3.9\times$的加速。