Neural networks have become a cornerstone of machine learning. As the trend for these to get more and more complex continues, so does the underlying hardware and software infrastructure for training and deployment. In this survey we answer three research questions: "What types of model parallelism exist?", "What are the challenges of model parallelism?", and "What is a modern use-case of model parallelism?" We answer the first question by looking at how neural networks can be parallelised and expressing these as operator graphs while exploring the available dimensions. The dimensions along which neural networks can be parallelised are intra-operator and inter-operator. We answer the second question by collecting and listing both implementation challenges for the types of parallelism, as well as the problem of optimally partitioning the operator graph. We answer the last question by collecting and listing how parallelism is applied in modern multi-billion parameter transformer networks, to the extend that this is possible with the limited information shared about these networks.
翻译:神经网络已成为机器学习的基石。随着神经网络日趋复杂,其训练与部署所依赖的底层硬件与软件基础设施也日趋复杂。本综述回答三个研究问题:"存在哪些类型的模型并行?"、"模型并行面临哪些挑战?"以及"模型并行的现代用例是什么?"为回答第一个问题,我们考察神经网络如何被并行化,并将其表达为算子图,同时探索可用的维度。神经网络可并行化的维度包括算子内并行与算子间并行。为回答第二个问题,我们收集并列举了各类并行方式的实现挑战,以及算子图最优划分问题。为回答最后一个问题,我们收集并列举了现代拥有数十亿参数规模的Transformer网络如何应用并行技术,鉴于这些网络公开信息有限,本文尽量进行详尽的阐述。