A plethora of modern machine learning tasks require the utilization of large-scale distributed clusters as a critical component of the training pipeline. However, abnormal Byzantine behavior of the worker nodes can derail the training and compromise the quality of the inference. Such behavior can be attributed to unintentional system malfunctions or orchestrated attacks; as a result, some nodes may return arbitrary results to the parameter server (PS) that coordinates the training. Recent work considers a wide range of attack models and has explored robust aggregation and/or computational redundancy to correct the distorted gradients. In this work, we consider attack models ranging from strong ones: $q$ omniscient adversaries with full knowledge of the defense protocol that can change from iteration to iteration to weak ones: $q$ randomly chosen adversaries with limited collusion abilities which only change every few iterations at a time. Our algorithms rely on redundant task assignments coupled with detection of adversarial behavior. We also show the convergence of our method to the optimal point under common assumptions and settings considered in literature. For strong attacks, we demonstrate a reduction in the fraction of distorted gradients ranging from 16%-99% as compared to the prior state-of-the-art. Our top-1 classification accuracy results on the CIFAR-10 data set demonstrate 25% advantage in accuracy (averaged over strong and weak scenarios) under the most sophisticated attacks compared to state-of-the-art methods.
翻译:现代机器学习任务中,大规模分布式集群已成为训练流程的关键组成部分。然而,工作节点的异常拜占庭行为可能破坏训练进程并损害推理质量。此类行为可能源于无意的系统故障或精心策划的攻击;因此,部分节点可能向协调训练的参数服务器返回任意结果。近期研究考虑了广泛的攻击模型,并探索了通过鲁棒聚合和/或计算冗余来修正扭曲梯度的方法。本文考虑的攻击模型涵盖从强攻击模型(即具备防御协议完全知识的$q$个全知对手,其行为可在不同迭代轮次间变化)到弱攻击模型(即合作能力有限的$q$个随机选择的对手,其行为每若干迭代轮次才发生变化)。我们的算法依赖于冗余任务分配与对抗行为检测的结合。在文献中常见的假设与设置下,我们证明了该方法能够收敛至最优解。针对强攻击模型,与现有最优方法相比,我们的方法将扭曲梯度比例降低了16%-99%。在CIFAR-10数据集上的Top-1分类准确率结果表明,面对最复杂的攻击,本方法在强攻击与弱攻击场景下的平均准确率较现有最优方法提升25%。