The ubiquity of distributed machine learning (ML) in sensitive public domain applications calls for algorithms that protect data privacy, while being robust to faults and adversarial behaviors. Although privacy and robustness have been extensively studied independently in distributed ML, their synthesis remains poorly understood. We present the first tight analysis of the error incurred by any algorithm ensuring robustness against a fraction of adversarial machines, as well as differential privacy (DP) for honest machines' data against any other curious entity. Our analysis exhibits a fundamental trade-off between privacy, robustness, and utility. Surprisingly, we show that the cost of this trade-off is marginal compared to that of the classical privacy-utility trade-off. To prove our lower bound, we consider the case of mean estimation, subject to distributed DP and robustness constraints, and devise reductions to centralized estimation of one-way marginals. We prove our matching upper bound by presenting a new distributed ML algorithm using a high-dimensional robust aggregation rule. The latter amortizes the dependence on the dimension in the error (caused by adversarial workers and DP), while being agnostic to the statistical properties of the data.
翻译:分布式机器学习(ML)在敏感公共领域应用中的普遍存在,要求算法既能保护数据隐私,又能抵御故障和对抗性行为。尽管隐私和鲁棒性在分布式ML中已得到广泛独立研究,但它们的综合仍缺乏深入理解。我们首次严格分析了任何算法在确保对一定比例对抗性机器的鲁棒性,以及针对其他好奇实体保护诚实机器数据差分隐私(DP)时的误差。我们的分析揭示了隐私、鲁棒性和效用之间的基本权衡。令人惊讶的是,我们表明这种权衡的成本相对于经典隐私-效用权衡而言微乎其微。为了证明下界,我们考虑了受分布式DP和鲁棒性约束的均值估计问题,并设计了到集中式单一边际估计的归约。我们通过提出一种使用高维鲁棒聚合规则的新型分布式ML算法,证明了匹配的上界。该规则在误差(由对抗性工作节点和DP引起)中摊销了对维度的依赖,同时对数据的统计特性保持不可知性。