In this paper, we investigate the challenging framework of Byzantine-robust training in distributed machine learning (ML) systems, focusing on enhancing both efficiency and practicality. As distributed ML systems become integral for complex ML tasks, ensuring resilience against Byzantine failures-where workers may contribute incorrect updates due to malice or error-gains paramount importance. Our first contribution is the introduction of the Centered Trimmed Meta Aggregator (CTMA), an efficient meta-aggregator that upgrades baseline aggregators to optimal performance levels, while requiring low computational demands. Additionally, we propose harnessing a recently developed gradient estimation technique based on a double-momentum strategy within the Byzantine context. Our paper highlights its theoretical and practical advantages for Byzantine-robust training, especially in simplifying the tuning process and reducing the reliance on numerous hyperparameters. The effectiveness of this technique is supported by theoretical insights within the stochastic convex optimization (SCO) framework.
翻译:本文研究了分布式机器学习系统中拜占庭鲁棒训练这一具有挑战性的框架,重点关注提升效率与实用性。随着分布式机器学习系统成为复杂机器学习任务不可或缺的部分,确保系统对拜占庭故障的弹性——即工作节点可能因恶意或错误提交错误更新——变得至关重要。我们的首要贡献是提出了中心化修剪元聚合器,这是一种高效的元聚合器,能够将基线聚合器提升至最优性能水平,同时计算需求较低。此外,我们建议在拜占庭场景中利用基于双动量策略的最新梯度估计技术。本文重点阐述了该技术在拜占庭鲁棒训练中的理论与实际优势,特别是在简化调参过程和减少对大量超参数依赖方面。该技术的有效性在随机凸优化框架下得到了理论支持。