We consider gradient coding in the presence of an adversary controlling so-called malicious workers trying to corrupt the computations. Previous works propose the use of MDS codes to treat the responses from malicious workers as errors and correct them using the error-correction properties of the code. This comes at the expense of increasing the replication, i.e., the number of workers each partial gradient is computed by. In this work, we propose a way to reduce the replication to $s+1$ instead of $2s+1$ in the presence of $s$ malicious workers. Our method detects erroneous inputs from the malicious workers, transforming them into erasures. This comes at the expense of $s$ additional local computations at the main node and additional rounds of light communication between the main node and the workers. We define a general framework and give fundamental limits for fractional repetition data allocations. Our scheme is optimal in terms of replication and local computation and incurs a communication cost that is asymptotically, in the size of the dataset, a multiplicative factor away from the derived bound. We furthermore show how additional redundancy can be exploited to reduce the number of local computations and communication cost, or, alternatively, tolerate straggling workers.
翻译:我们考虑在存在对抗者控制所谓恶意工作者以试图破坏计算的场景下的梯度编码问题。先前的工作利用MDS码将恶意工作者的响应视为错误,并通过码的纠错特性进行纠正,但代价是增加了备份因子,即每个局部梯度被计算的工作者数量。本文提出一种方法,在存在$s$个恶意工作者时,将备份因子从$2s+1$降低至$s+1$。该方法能够检测恶意工作者输入的误差,并将其转化为擦除,代价是在主节点增加$s$次局部计算以及主节点与工作者之间额外数轮的轻量通信。我们定义了一个通用框架,并给出了分数重复数据分配的基本极限。所提方案在备份因子和局部计算方面达到最优,其通信成本在数据集规模上渐近地等于导出边界乘以一个常数因子。此外,我们展示了如何利用额外冗余来减少局部计算次数和通信成本,或者替代性地容忍掉队工作者。