Modern ML applications increasingly rely on complex deep learning models and large datasets. There has been an exponential growth in the amount of computation needed to train the largest models. Therefore, to scale computation and data, these models are inevitably trained in a distributed manner in clusters of nodes, and their updates are aggregated before being applied to the model. However, a distributed setup is prone to Byzantine failures of individual nodes, components, and software. With data augmentation added to these settings, there is a critical need for robust and efficient aggregation systems. We define the quality of workers as reconstruction ratios $\in (0,1]$, and formulate aggregation as a Maximum Likelihood Estimation procedure using Beta densities. We show that the Regularized form of log-likelihood wrt subspace can be approximately solved using iterative least squares solver, and provide convergence guarantees using recent Convex Optimization landscape results. Our empirical findings demonstrate that our approach significantly enhances the robustness of state-of-the-art Byzantine resilient aggregators. We evaluate our method in a distributed setup with a parameter server, and show simultaneous improvements in communication efficiency and accuracy across various tasks. The code is publicly available at https://github.com/hamidralmasi/FlagAggregator
翻译:现代机器学习应用日益依赖复杂的深度学习模型和大型数据集。训练最大模型所需的计算量呈指数级增长。因此,为扩展计算与数据规模,这些模型不可避免地以分布式方式在节点集群中进行训练,其更新在应用于模型前被聚合。然而,分布式环境易受单个节点、组件和软件的拜占庭故障影响。当数据增强加入此类场景后,亟需鲁棒且高效的聚合系统。我们将工作节点的质量定义为重构比率 $\in (0,1]$,并利用贝塔密度将聚合问题表述为最大似然估计过程。我们证明,对数似然函数关于子空间的正则化形式可通过迭代最小二乘求解器近似求解,并利用近期凸优化景观结果给出收敛性保证。实验结果表明,我们的方法显著增强了现有最先进拜占庭鲁棒聚合器的鲁棒性。我们在包含参数服务器的分布式设置中评估该方法,并展示了其在多种任务中通信效率与准确率的同步提升。代码已公开于 https://github.com/hamidralmasi/FlagAggregator