With the rapid development of deep learning models and hardware support for dense computing, the deep learning workload characteristics changed significantly from a few hot spots on compute-intensive operations to a broad range of operations scattered across the models. Accelerating a few compute-intensive operations using the expert-tuned implementation of primitives does not fully exploit the performance potential of AI hardware. Various efforts have been made to compile a full deep neural network (DNN) graph. One of the biggest challenges is to achieve high-performance tensor compilation by generating expert level performance code for the dense compute-intensive operations and applying compilation optimization at the scope of DNN computation graph across multiple compute-intensive operations. We present oneDNN Graph Compiler, a tensor compiler that employs a hybrid approach of using techniques from both compiler optimization and expert-tuned kernels for high performance code generation of the deep neural network graph. oneDNN Graph Compiler addresses unique optimization challenges in the deep learning domain, such as low-precision computation, aggressive fusion of graph operations, optimization for static tensor shapes and memory layout, constant weight optimization, and memory buffer reuse. Experimental results demonstrate significant performance gains over existing tensor compiler and primitives library for performance-critical DNN computation graphs and end-to-end models on Intel Xeon Scalable Processors.
翻译:随着深度学习模型的快速发展和硬件对密集计算的支持,深度学习工作负载特征发生了显著变化——从聚焦于少量计算密集型操作的热点区域,转变为覆盖模型中广泛分布的多样化操作。使用专家调优的原语实现来加速少数计算密集型操作,已无法完全释放AI硬件的性能潜力。学术界已开展多项工作致力于编译完整的深度神经网络(DNN)图。其中最大挑战之一在于:既要通过生成专家级性能代码实现密集计算密集型操作的张量编译,又要跨多个计算密集型操作进行DNN计算图全局范围内的编译优化。为此,我们提出oneDNN Graph Compiler——一种采用编译器优化与专家调优内核混合技术的张量编译器,用于深度神经网络图的高性能代码生成。该编译器针对深度学习领域的独特优化难题,包括低精度计算、激进的计算图操作融合、静态张量形状与内存布局优化、常量权重优化及内存缓冲区复用。实验结果表明,在Intel Xeon可扩展处理器上,该方案在性能关键型DNN计算图和端到端模型上相比现有张量编译器和原语库实现了显著的性能提升。