The success of machine learning (ML) has been intimately linked with the availability of large amounts of data, typically collected from heterogeneous sources and processed on vast networks of computing devices (also called {\em workers}). Beyond accuracy, the use of ML in critical domains such as healthcare and autonomous driving calls for robustness against {\em data poisoning}and some {\em faulty workers}. The problem of {\em Byzantine ML} formalizes these robustness issues by considering a distributed ML environment in which workers (storing a portion of the global dataset) can deviate arbitrarily from the prescribed algorithm. Although the problem has attracted a lot of attention from a theoretical point of view, its practical importance for addressing realistic faults (where the behavior of any worker is locally constrained) remains unclear. It has been argued that the seemingly weaker threat model where only workers' local datasets get poisoned is more reasonable. We prove that, while tolerating a wider range of faulty behaviors, Byzantine ML yields solutions that are, in a precise sense, optimal even under the weaker data poisoning threat model. Then, we study a generic data poisoning model wherein some workers have {\em fully-poisonous local data}, i.e., their datasets are entirely corruptible, and the remainders have {\em partially-poisonous local data}, i.e., only a fraction of their local datasets is corruptible. We prove that Byzantine-robust schemes yield optimal solutions against both these forms of data poisoning, and that the former is more harmful when workers have {\em heterogeneous} local data.
翻译:机器学习(ML)的成功与大量数据的可用性密切相关,这些数据通常来自异构源,并在庞大的计算设备网络(也称为工作节点)上处理。除了准确性之外,在医疗和自动驾驶等关键领域使用ML要求对数据投毒和某些故障工作节点具有鲁棒性。拜占庭ML问题通过考虑一个分布式ML环境来形式化这些鲁棒性问题,在该环境中,工作节点(存储全局数据集的一部分)可能任意偏离规定的算法。尽管该问题从理论角度引起了广泛关注,但其在处理现实故障(其中任何工作节点的行为都受局部约束)方面的实际重要性仍不明确。有观点认为,看似更弱的威胁模型(仅工作节点的本地数据集被投毒)更为合理。我们证明,虽然拜占庭ML容忍了更广泛的故障行为,但即使在较弱的数据投毒威胁模型下,它也能在精确意义上产生最优解。接着,我们研究了一个通用数据投毒模型,其中某些工作节点具有完全可投毒的本地数据(即其数据集完全可被破坏),而其余工作节点具有部分可投毒的本地数据(即仅其本地数据集的一部分可被破坏)。我们证明,拜占庭鲁棒方案在这两种数据投毒形式下均能产生最优解,并且当工作节点具有异构本地数据时,前者更具危害性。