We consider the problem of evaluating arbitrary multivariate polynomials over a massive dataset containing multiple inputs, on a distributed computing system with a master node and multiple worker nodes. Generalized Lagrange Coded Computing (GLCC) codes are proposed to simultaneously provide resiliency against stragglers who do not return computation results in time, security against adversarial workers who deliberately modify results for their benefit, and information-theoretic privacy of the dataset amidst possible collusion of workers. GLCC codes are constructed by first partitioning the dataset into multiple groups, then encoding the dataset using carefully designed interpolation polynomials, and sharing multiple encoded data points to each worker, such that interference computation results across groups can be eliminated at the master. Particularly, GLCC codes include the state-of-the-art Lagrange Coded Computing (LCC) codes as a special case, and exhibit a more flexible tradeoff between communication and computation overheads in optimizing system efficiency. Furthermore, we apply GLCC to distributed training of machine learning models, and demonstrate that GLCC codes achieve a speedup of up to $2.5\text{--}3.9\times$ over LCC codes in training time, across experiments for training image classifiers on different datasets, model architectures, and straggler patterns.
翻译:我们研究在包含一个主节点和多个工作节点的分布式计算系统中,对含多输入的大规模数据集进行任意多元多项式求值的问题。提出广义拉格朗日编码计算(GLCC)码,可同时实现:对未及时返回计算结果的掉队者的弹性容错、对抗恶意篡改结果以谋私利的攻击者的安全性,以及在工作节点可能合谋情况下数据集的信息论隐私保护。GLCC码的构造过程为:先将数据集划分为多个分组,再利用精心设计的插值多项式对数据集进行编码,并向每个工作节点共享多个编码数据点,从而使主节点能消除跨分组的干扰计算结果。特别地,GLCC码将现有最优的拉格朗日编码计算(LCC)码作为特例,并在优化系统效率时展现出通信开销与计算开销之间更灵活的权衡。此外,我们将GLCC应用于机器学习模型的分布式训练,实验表明:在不同数据集、模型架构和掉队模式下的图像分类器训练中,GLCC码的训练速度较LCC码提升可达2.5至3.9倍。