Distributed model training needs to be adapted to challenges such as the straggler effect and Byzantine attacks. When coordinating the training process with multiple computing nodes, ensuring timely and reliable gradient aggregation amidst network and system malfunctions is essential. To tackle these issues, we propose \textit{dSTAR}, a lightweight and efficient approach for distributed stochastic gradient descent (SGD) that enhances robustness and convergence. \textit{dSTAR} selectively aggregates gradients by collecting updates from the first \(k\) workers to respond, filtering them based on deviations calculated using an ensemble median. This method not only mitigates the impact of stragglers but also fortifies the model against Byzantine adversaries. We theoretically establish that \textit{dSTAR} is (\(\alpha, f\))-Byzantine resilient and achieves a linear convergence rate. Empirical evaluations across various scenarios demonstrate that \textit{dSTAR} consistently maintains high accuracy, outperforming other Byzantine-resilient methods that often suffer up to a 40-50\% accuracy drop under attack. Our results highlight \textit{dSTAR} as a robust solution for training models in distributed environments prone to both straggler delays and Byzantine faults.
翻译:分布式模型训练需要适应诸如掉队者效应和拜占庭攻击等挑战。在使用多个计算节点协调训练过程时,确保在网络和系统故障情况下及时、可靠地进行梯度聚合至关重要。为解决这些问题,我们提出了 \textit{dSTAR},一种轻量且高效的分布式随机梯度下降方法,它增强了鲁棒性和收敛性。\textit{dSTAR} 通过收集最先响应的 \(k\) 个工作节点的更新,并基于使用集成中位数计算的偏差对其进行过滤,从而选择性地聚合梯度。该方法不仅减轻了掉队者的影响,还增强了模型抵御拜占庭攻击的能力。我们从理论上证明了 \textit{dSTAR} 具有 (\(\alpha, f\))-拜占庭容错能力,并实现了线性收敛速率。在各种场景下的实证评估表明,\textit{dSTAR} 始终能保持高精度,优于其他拜占庭容错方法——这些方法在遭受攻击时精度通常会下降高达 40-50%。我们的结果凸显了 \textit{dSTAR} 作为一种鲁棒解决方案,适用于在易受掉队者延迟和拜占庭故障影响的分布式环境中训练模型。