Stencil Computations on Cerebras Wafer-Scale Engine

Stencil computations are a fundamental kernel in scientific computing, critical for simulations in domains such as fluid dynamics and climate modeling. However, these computations are often memory-bound on traditional High-Performance Computing architectures like GPUs, struggling against the "Memory Wall". Simultaneously, the rise of AI-oriented hardware, such as the Cerebras Wafer-Scale Engine, offers massive core parallelism and high-bandwidth on-chip memory, though typically optimized for lower-precision workloads. This work investigates the viability of bridging this divergence by mapping stencil algorithms onto the Cerebras WSE-3. The study introduces CStencil, a novel framework designed to implement two-dimensional stencil computations on the WSE-3. To ensure a rigorous and fair performance evaluation, the research also adapts ConvStencil, a state-of-the-art GPU stencil solver, porting it from its original double-precision design to single-precision for execution on an NVIDIA A100 GPU. Experimental results show that the WSE-3's distributed SRAM and mesh interconnect effectively eliminate the off-chip memory bottlenecks common in GPU implementations. CStencil achieves speedups of up to 342x over the adapted ConvStencil version. A roofline model analysis further confirms that CStencil saturates the available compute and memory resources, demonstrating that the WSE dataflow architecture can be successfully repurposed for traditional scientific algorithms. These findings highlight the potential of the WSE-3 to deliver hardware utilization levels unattainable on conventional systems, offering a promising path toward overcoming the memory limitations of current HPC architectures.

翻译：模板计算是科学计算中的基础核函数，对流体动力学和气候建模等领域的模拟至关重要。然而，在传统高性能计算架构（如GPU）上，这些计算常受限于内存，难以突破“内存墙”。与此同时，面向AI的硬件（例如Cerebras晶圆级引擎）提供了大规模核心并行性和高带宽片上内存，尽管通常针对低精度工作负载进行了优化。本研究通过将模板算法映射到Cerebras WSE-3上，探索弥合这一分歧的可行性。研究提出了CStencil，一种新颖框架，旨在在WSE-3上实现二维模板计算。为确保严谨且公平的性能评估，本研究还改编了ConvStencil（一种先进的GPU模板求解器），将其从原始的双精度设计移植为单精度，以在NVIDIA A100 GPU上执行。实验结果表明，WSE-3的分布式SRAM和网格互连有效消除了GPU实现中常见的片外内存瓶颈。CStencil相比改编后的ConvStencil版本实现了高达342倍的加速。屋脊线模型分析进一步证实，CStencil充分利用了可用的计算和内存资源，证明了WSE数据流架构可成功应用于传统科学算法。这些发现凸显了WSE-3实现传统系统难以达到的硬件利用率的潜力，为克服当前高性能计算架构的内存限制提供了一条有前景的路径。