Machine Learning (ML) models execute several parallel computations including Generalized Matrix Multiplication, Convolution, Dropout, etc. These computations are commonly executed on Graphics Processing Units (GPUs), by dividing the computation into independent processing blocks, known as tiles. Since the number of tiles are usually higher than the execution units of a GPU, tiles are executed on all execution units in one or more waves. However, the number of tiles is not always a multiple of the number of execution units. Thus, tiles executed in the final wave can under-utilize the GPU. To address this issue, we present cuSync, a framework for synchronizing dependent kernels using a user-defined fine-grained synchronization policy to improve the GPU utilization. cuSync synchronizes tiles instead of kernels, which allows executing independent tiles of dependent kernels concurrently. We also present a compiler to generate diverse fine-grained synchronization policies based on dependencies between kernels. Our experiments found that synchronizing CUDA kernels using cuSync reduces the inference times of four popular ML models: MegatronLM GPT-3 by up to 15%, LLaMA by up to 14%, ResNet-38 by up to 22%, and VGG-19 by up to 16% over several batch sizes.
翻译:机器学习(ML)模型需执行多种并行计算,包括广义矩阵乘法、卷积、随机失活等。这些计算通常在图形处理器(GPU)上通过将计算划分为称为"分块"的独立处理单元来实现。由于分块数量通常超过GPU执行单元数量,分块会在所有执行单元上进行单次或多次波次执行。然而分块数量并非始终是执行单元数量的整数倍,导致最终波次执行的分块可能造成GPU利用率不足。针对该问题,我们提出cuSync——一种通过用户定义的细粒度同步策略实现依赖型内核同步的框架,旨在提升GPU利用率。cuSync采用分块级同步而非内核级同步,可使得依赖型内核中的独立分块并发执行。我们还提出了一种编译器,可根据内核间的依赖关系生成多样化的细粒度同步策略。实验表明,使用cuSync同步CUDA内核后,在多种批量大小下,四种主流ML模型(MegatronLM GPT-3、LLaMA、ResNet-38、VGG-19)的推理时间分别降低高达15%、14%、22%和16%。