We consider gradient coding in the presence of an adversary controlling so-called malicious workers trying to corrupt the computations. Previous works propose the use of MDS codes to treat the responses from malicious workers as errors and correct them using the error-correction properties of the code. This comes at the expense of increasing the replication, i.e., the number of workers each partial gradient is computed by. In this work, we propose a way to reduce the replication to $s+1$ instead of $2s+1$ in the presence of $s$ malicious workers. Our method detects erroneous inputs from the malicious workers, transforming them into erasures. This comes at the expense of $s$ additional local computations at the main node and additional rounds of light communication between the main node and the workers. We define a general framework and give fundamental limits for fractional repetition data allocations. Our scheme is optimal in terms of replication and local computation and incurs a communication cost that is asymptotically, in the size of the dataset, a multiplicative factor away from the derived bound. We furthermore show how additional redundancy can be exploited to reduce the number of local computations and communication cost, or, alternatively, tolerate straggling workers.
翻译:我们考虑存在恶意工作节点(由对手控制)试图破坏计算场景下的梯度编码问题。先前研究采用MDS码将恶意节点的响应视为错误,并利用编码的纠错特性进行修正,但这会以增加复制因子(即每个部分梯度被计算的工作节点数量)为代价。本文提出一种方法,当存在$s$个恶意节点时,可将复制因子从$2s+1$降低至$s+1$。该方法通过检测恶意节点的错误输入,将其转化为擦除错误。为此,主节点需额外执行$s$次局部计算,并与工作节点进行多轮轻量级通信。我们建立了一个通用框架,并给出了分式复制数据分配的极限。所提方案在复制因子和局部计算方面达到最优,通信成本在数据集规模渐近意义下与理论界值相差一个乘法因子。进一步研究表明,可利用额外冗余来减少局部计算次数和通信成本,或容忍掉队工作节点。