Generalized linear models (GLMs) are a widely utilized family of machine learning models in real-world applications. As data size increases, it is essential to perform efficient distributed training for these models. However, existing systems for distributed training have a high cost for communication and often use large batch sizes to balance computation and communication, which negatively affects convergence. Therefore, we argue for an efficient distributed GLM training system that strives to achieve linear scalability, while keeping batch size reasonably low. As a start, we propose P4SGD, a distributed heterogeneous training system that efficiently trains GLMs through model parallelism between distributed FPGAs and through forward-communication-backward pipeline parallelism within an FPGA. Moreover, we propose a light-weight, latency-centric in-switch aggregation protocol to minimize the latency of the AllReduce operation between distributed FPGAs, powered by a programmable switch. As such, to our knowledge, P4SGD is the first solution that achieves almost linear scalability between distributed accelerators through model parallelism. We implement P4SGD on eight Xilinx U280 FPGAs and a Tofino P4 switch. Our experiments show P4SGD converges up to 6.5X faster than the state-of-the-art GPU counterpar.
翻译:广义线性模型(GLMs)是实际应用中广泛使用的一类机器学习模型。随着数据规模的增大,高效实现这些模型的分布式训练至关重要。然而,现有的分布式训练系统存在通信成本高的问题,且常采用大批量数据来平衡计算与通信,这对收敛性造成了负面影响。因此,我们提出构建一个高效的分布式GLM训练系统,力求在保持合理小批量数据的同时实现线性可扩展性。作为起点,我们提出P4SGD——一种分布式异构训练系统,通过分布式FPGA间的模型并行以及FPGA内部的前向-通信-反向流水线并行,高效训练GLMs。此外,我们提出一种轻量级、延迟敏感型的交换机内聚合协议,借助可编程交换机最大程度降低分布式FPGA间AllReduce操作的延迟。据我们所知,P4SGD是首个通过模型并行在分布式加速器间实现近似线性可扩展性的解决方案。我们在八块Xilinx U280 FPGA和一个Tofino P4交换机上实现了P4SGD。实验表明,P4SGD的收敛速度比当前最先进的GPU方案快6.5倍。