As investment in AI-focused accelerators grows and their deployment in supercomputing facilities expands, understanding whether these architectures can efficiently support traditional scientific kernels is critical for the future of High-Performance Computing. We investigate the mapping of 2D 5-point stencil computations onto the Tenstorrent Wormhole, a RISC-V AI dataflow accelerator. We develop two heterogeneous implementations: Axpy, which decomposes the stencil into element-wise submatrix operations, and MatMul, which reformulates it as a matrix multiplication. While the CPU baseline remains 3x faster end-to-end, profiling reveals that the isolated Wormhole kernel is competitive with CPU execution, with the gap driven by PCIe transfers, device initialization, and host-side preprocessing. Despite slower runtime, Axpy achieves lower energy consumption than the CPU baseline for large inputs. Through detailed profiling and theoretical analysis, we identify key architectural and software limitations of the current platform and outline concrete hardware and software directions that could make AI accelerators competitive for HPC workloads.
翻译:随着面向AI加速器的投资增加及其在超算设施中的部署扩大,理解这些架构能否高效支持传统科学计算核,对于高性能计算的未来至关重要。我们研究了将二维五点模板计算映射至Tenstorrent Wormhole(一款RISC-V AI数据流加速器)的方法。我们开发了两种异构实现:Axpy(将模板分解为逐元素子矩阵操作)和MatMul(将模板重构为矩阵乘法)。尽管CPU基线端到端速度快3倍,但性能分析表明,独立的Wormhole核在执行时与CPU具有竞争力,性能差距主要由PCIe传输、设备初始化及主机端预处理导致。尽管运行时较慢,Axpy在处理大型输入时能耗低于CPU基线。通过详细的性能分析与理论评估,我们指出了当前平台的关键架构与软件限制,并提出了使AI加速器在HPC工作负载中具备竞争力的具体硬件与软件发展方向。