We consider gradient coding in the presence of an adversary, controlling so-called malicious workers trying to corrupt the computations. Previous works propose the use of MDS codes to treat the inputs of the malicious workers as errors and correct them using the error-correction properties of the code. This comes at the expense of increasing the replication, i.e., the number of workers each partial gradient is computed by. In this work, we reduce replication by proposing a method that detects the erroneous inputs from the malicious workers, hence transforming them into erasures. For $s$ malicious workers, our solution can reduce the replication to $s+1$ instead of $2s+1$ for each partial gradient at the expense of only $s$ additional computations at the main node and additional rounds of light communication between the main node and the workers. We give fundamental limits of the general framework for fractional repetition data allocation. Our scheme is optimal in terms of replication and local computation but incurs a communication cost that is asymptotically, in the size of the dataset, a multiplicative factor away from the derived bound.
翻译:我们考虑存在对手控制的所谓恶意工作节点试图破坏计算的场景下的梯度编码问题。先前的工作提出使用MDS码将恶意工作节点的输入视为错误,并利用码的纠错特性对其进行纠正。这以增加复制因子为代价,即每个局部梯度由更多工作节点计算。在本工作中,我们通过提出一种检测恶意工作节点错误输入的方法来降低复制因子,从而将其转化为擦除。对于$s$个恶意工作节点,我们的方案可将每个局部梯度的复制因子从$2s+1$降低至$s+1$,仅需主节点额外进行$s$次计算以及主节点与工作节点之间进行额外几轮轻量通信。我们给出了分数重复数据分配通用框架的理论极限。我们的方案在复制因子和本地计算方面达到最优,但通信成本在数据集规模上渐近地偏离理论下界一个乘法因子。