The Cerebras Wafer-Scale Engine (WSE) delivers performance at an unprecedented scale of over 900,000 compute units, all connected via a single-wafer on-chip interconnect. Initially designed for AI, the WSE architecture is also well-suited for High Performance Computing (HPC). However, its distributed asynchronous programming model diverges significantly from the simple sequential or bulk-synchronous programs that one would typically derive for a given mathematical program description. Targeting the WSE requires a bespoke re-implementation when porting existing code. The absence of WSE support in compilers such as MLIR, meant that there was little hope for automating this process. Stencils are ubiquitous in HPC, and in this paper we explore the hypothesis that domain specific information about stencils can be leveraged by the compiler to automatically target the WSE without requiring application-level code changes. We present a compiler pipeline that transforms stencil-based kernels into highly optimized CSL code for the WSE, bridging the semantic gap between the mathematical representation of the problem and the WSE's asynchronous execution model. Based upon five benchmarks across three HPC programming technologies, running on both the Cerebras WSE2 and WSE3, our approach delivers comparable, if not slightly better, performance than manually optimized code. Furthermore, without requiring any application level code changes, performance on the WSE3 is around 14 times faster than 128 Nvidia A100 GPUs and 20 times faster than 128 nodes of a CPU-based Cray-EX supercomputer when using our approach.
翻译:Cerebras晶圆级引擎(WSE)通过单晶圆片上互连网络连接超过90万个计算单元,实现了前所未有的计算规模。该架构虽最初为人工智能设计,同样适用于高性能计算领域。然而,其分布式异步编程模型与基于数学程序描述衍生的传统顺序或批量同步程序存在显著差异。现有代码移植至WSE平台需要定制化重构,而MLIR等编译框架缺乏WSE支持,导致该过程难以自动化。Stencil计算作为高性能计算的普适性模式,本文验证了以下假设:编译器可利用Stencil计算的领域特定信息,在不修改应用层代码的前提下自动适配WSE架构。我们提出了一套编译流程,将基于Stencil计算的核心算法转化为高度优化的WSE专用CSL代码,从而弥合问题数学表示与WSE异步执行模型间的语义鸿沟。基于三种高性能计算编程技术的五项基准测试(在Cerebras WSE2和WSE3平台运行),本方法实现了与人工优化代码相当甚至更优的性能。更值得注意的是,在无需修改应用层代码的情况下,采用本方法的WSE3平台性能较128块Nvidia A100 GPU提升约14倍,较128节点CPU架构的Cray-EX超级计算机提升20倍。