During the past decade, Deep Learning (DL) algorithms, programming systems and hardware have converged with the High Performance Computing (HPC) counterparts. Nevertheless, the programming methodology of DL and HPC systems is stagnant, relying on highly-optimized, yet platform-specific and inflexible vendor-optimized libraries. Such libraries provide close-to-peak performance on specific platforms, kernels and shapes thereof that vendors have dedicated optimizations efforts, while they underperform in the remaining use-cases, yielding non-portable codes with performance glass-jaws. This work introduces a framework to develop efficient, portable DL and HPC kernels for modern CPU architectures. We decompose the kernel development in two steps: 1) Expressing the computational core using Tensor Processing Primitives (TPPs): a compact, versatile set of 2D-tensor operators, 2) Expressing the logical loops around TPPs in a high-level, declarative fashion whereas the exact instantiation (ordering, tiling, parallelization) is determined via simple knobs. We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
翻译:在过去十年中,深度学习(DL)算法、编程系统和硬件已与高性能计算(HPC)领域逐渐融合。然而,DL和HPC系统的编程方法仍停滞不前,依赖于高度优化但受限于特定平台且缺乏灵活性的供应商优化库。这些库在供应商投入优化工作的特定平台、内核及其形状上能提供接近峰值的性能,但在其他用例中表现不佳,导致代码不可移植且存在性能瓶颈。本文提出了一种框架,用于在现代CPU架构上开发高效且可移植的DL和HPC内核。我们将内核开发分解为两个步骤:1)使用张量处理原语(TPP)表达计算核心——这是一组紧凑且通用的二维张量算子;2)以高层声明式方式表达围绕TPP的逻辑循环,而其具体实例化(顺序、分块、并行化)通过简单调控参数确定。我们通过独立内核和端到端工作负载验证了该方法的有效性,结果表明在多种CPU平台上,其性能均优于现有最优实现。