The convergence of SGD based distributed training algorithms is tied to the data distribution across workers. Standard partitioning techniques try to achieve equal-sized partitions with per-class population distribution in proportion to the total dataset. Partitions having the same overall population size or even the same number of samples per class may still have Non-IID distribution in the feature space. In heterogeneous computing environments, when devices have different computing capabilities, even-sized partitions across devices can lead to the straggler problem in distributed SGD. We develop a framework for distributed SGD in heterogeneous environments based on a novel data partitioning algorithm involving submodular optimization. Our data partitioning algorithm explicitly accounts for resource heterogeneity across workers while achieving similar class-level feature distribution and maintaining class balance. Based on this algorithm, we develop a distributed SGD framework that can accelerate existing SOTA distributed training algorithms by up to 32%.
翻译:基于SGD的分布式训练算法的收敛性依赖于各工作节点间的数据分布。标准划分技术试图实现按类别总体分布与总数据集成比例的等大小分区。即使整体样本量相同或每类样本数一致的分区,在特征空间中仍可能存在非独立同分布(Non-IID)问题。在异构计算环境中,当设备具有不同计算能力时,设备间的等大小分区会导致分布式SGD中的掉队者问题。我们提出一种基于新型次模优化数据划分算法的异构环境分布式SGD框架。该数据划分算法在实现相似的类级特征分布并保持类别平衡的同时,显式考虑了跨工作节点的资源异构性。基于该算法,我们开发了可加速现有最优分布式训练算法最高达32%的分布式SGD框架。