We tackle the problem of Byzantine errors in distributed gradient descent within the Byzantine-resilient gradient coding framework. Our proposed solution can recover the exact full gradient in the presence of $s$ malicious workers with a data replication factor of only $s+1$. It generalizes previous solutions to any data assignment scheme that has a regular replication over all data samples. The scheme detects malicious workers through additional interactive communication and a small number of local computations at the main node, leveraging group-wise comparisons between workers with a provably optimal grouping strategy. The scheme requires at most $s$ interactive rounds that incur a total communication cost logarithmic in the number of data samples.
翻译:我们在拜占庭鲁棒梯度编码框架中应对分布式梯度下降中的拜占庭错误问题。所提出的解决方案能够在存在$s$个恶意工作节点且数据复制因子仅为$s+1$的情况下恢复精确的全梯度。该方案将先前解法推广到任何具有对所有数据样本进行正则复制的数据分配方案。方案通过额外交互通信和主节点上的少量本地计算,利用工作节点间基于可证明最优分组策略的组间比较来检测恶意节点。该方案最多需要$s$轮交互,总通信成本与数据样本数量呈对数关系。