This paper presents a SYCL implementation of Multi-Layer Perceptrons (MLPs), which targets and is optimized for the Intel Data Center GPU Max 1550. To increase the performance, our implementation minimizes the slow global memory accesses by maximizing the data reuse within the general register file and the shared local memory by fusing the operations in each layer of the MLP. We show with a simple roofline model that this results in a significant increase in the arithmetic intensity, leading to improved performance, especially for inference. We compare our approach to a similar CUDA implementation for MLPs and show that our implementation on the Intel Data Center GPU outperforms the CUDA implementation on Nvidia's H100 GPU by a factor up to 2.84 in inference and 1.75 in training. The paper also showcases the efficiency of our SYCL implementation in three significant areas: Image Compression, Neural Radiance Fields, and Physics-Informed Machine Learning. In all cases, our implementation outperforms the off-the-shelf Intel Extension for PyTorch (IPEX) implementation on the same Intel GPU by up to a factor of 30 and the CUDA PyTorch version on Nvidia's H100 GPU by up to a factor 19. The code can be found at https://github.com/intel/tiny-dpcpp-nn.
翻译:本文提出了一种面向Intel数据中心GPU Max 1550优化设计的SYCL实现方案,用于多层感知机(MLPs)。为提升性能,本实现通过融合MLP各层操作,最大化通用寄存器文件与共享本地内存中的数据复用,从而减少缓慢的全局内存访问。基于简单屋顶线模型的分析表明,该方法可显著提升运算强度,尤其在推理场景下带来性能改进。我们将该方法与同类CUDA版MLP实现进行对比,结果显示:在Intel数据中心GPU上的实现相较于Nvidia H100 GPU上的CUDA实现,推理性能最高提升2.84倍,训练性能最高提升1.75倍。本文还展示了该SYCL实现在三大重要领域的效能:图像压缩、神经辐射场与物理信息机器学习。在所有场景中,我们的实现在相同Intel GPU上相较于Intel PyTorch扩展(IPEX)标准实现最高提升30倍,相较于Nvidia H100 GPU上的CUDA PyTorch版本最高提升19倍。代码已开源在https://github.com/intel/tiny-dpcpp-nn。