Deep neural networks (DNNs) have great potential to solve many real-world problems, but they usually require an extensive amount of computation and memory. It is of great difficulty to deploy a large DNN model to a single resource-limited device with small memory capacity. Distributed computing is a common approach to reduce single-node memory consumption and to accelerate the inference of DNN models. In this paper, we explore the "within-layer model parallelism", which distributes the inference of each layer into multiple nodes. In this way, the memory requirement can be distributed to many nodes, making it possible to use several edge devices to infer a large DNN model. Due to the dependency within each layer, data communications between nodes during this parallel inference can be a bottleneck when the communication bandwidth is limited. We propose a framework to train DNN models for Distributed Inference with Sparse Communications (DISCO). We convert the problem of selecting which subset of data to transmit between nodes into a model optimization problem, and derive models with both computation and communication reduction when each layer is inferred on multiple nodes. We show the benefit of the DISCO framework on a variety of CV tasks such as image classification, object detection, semantic segmentation, and image super resolution. The corresponding models include important DNN building blocks such as convolutions and transformers. For example, each layer of a ResNet-50 model can be distributively inferred across two nodes with five times less data communications, almost half overall computations and half memory requirement for a single node, and achieve comparable accuracy to the original ResNet-50 model. This also results in 4.7 times overall inference speedup.
翻译:深度神经网络(DNN)在解决许多实际问题中具有巨大潜力,但通常需要大量的计算和内存资源。将大型DNN模型部署到内存容量有限的单一资源受限设备上存在巨大困难。分布式计算是减少单节点内存消耗并加速DNN模型推理的常用方法。本文探索了“层内模型并行”方法,将每一层的推理分布到多个节点上。通过这种方式,内存需求可以分散到多个节点,从而能够利用多个边缘设备推理大型DNN模型。由于层内存在数据依赖关系,当通信带宽受限时,并行推理过程中节点间的数据传输可能成为瓶颈。我们提出了一种框架,用于训练具有稀疏通信的分布式推理(DISCO)DNN模型。我们将节点间待传输数据子集的选择问题转化为模型优化问题,并推导出当每一层在多个节点上推理时,既能减少计算量又能降低通信量的模型。我们展示了DISCO框架在多种计算机视觉任务(如图像分类、目标检测、语义分割和图像超分辨率)中的优势,对应模型包含卷积和Transformer等重要DNN构建模块。例如,ResNet-50模型的每一层可以在两个节点上分布式推理,通信数据量减少五倍,单节点整体计算量和内存需求几乎减半,同时保持与原始ResNet-50模型相当的精度。这还实现了4.7倍的整体推理速度提升。