Modern ML applications increasingly rely on complex deep learning models and large datasets. There has been an exponential growth in the amount of computation needed to train the largest models. Therefore, to scale computation and data, these models are inevitably trained in a distributed manner in clusters of nodes, and their updates are aggregated before being applied to the model. However, a distributed setup is prone to byzantine failures of individual nodes, components, and software. With data augmentation added to these settings, there is a critical need for robust and efficient aggregation systems. We extend the current state-of-the-art aggregators and propose an optimization-based subspace estimator by modeling pairwise distances as quadratic functions by utilizing the recently introduced Flag Median problem. The estimator in our loss function favors the pairs that preserve the norm of the difference vector. We theoretically show that our approach enhances the robustness of state-of-the-art byzantine resilient aggregators. Also, we evaluate our method with different tasks in a distributed setup with a parameter server architecture and show its communication efficiency while maintaining similar accuracy. The code is publicly available at https://github.com/hamidralmasi/FlagAggregator
翻译:现代机器学习应用日益依赖复杂的深度学习模型和大规模数据集。训练最大规模模型所需的计算量呈指数级增长。因此,为扩展计算与数据规模,这些模型不可避免地以分布式方式在节点集群中进行训练,其更新在应用于模型前会被聚合。然而,分布式环境容易遭受单个节点、组件和软件中的拜占庭故障。在此类设置中加入数据增强后,对鲁棒且高效的聚合系统的需求变得至关重要。我们扩展了当前最先进的聚合器,通过将成对距离建模为二次函数,利用近期提出的Flag Median问题,提出了一种基于优化的子空间估计方法。我们损失函数中的估计器倾向于保留差向量范数的配对。我们从理论上证明,该方法增强了最先进拜占庭鲁棒聚合器的鲁棒性。此外,我们在参数服务器架构的分布式环境中,使用不同任务评估了该方法,展示了其在保持相似精度的同时具备通信效率。代码公开于https://github.com/hamidralmasi/FlagAggregator