Efficient Reduce and AllReduce communication collectives are a critical cornerstone of high-performance computing (HPC) applications. We present the first systematic investigation of Reduce and AllReduce on the Cerebras Wafer-Scale Engine (WSE). This architecture has been shown to achieve unprecedented performance both for machine learning workloads and other computational problems like FFT. We introduce a performance model to estimate the execution time of algorithms on the WSE and validate our predictions experimentally for a wide range of input sizes. In addition to existing implementations, we design and implement several new algorithms specifically tailored to the architecture. Moreover, we establish a lower bound for the runtime of a Reduce operation on the WSE. Based on our model, we automatically generate code that achieves near-optimal performance across the whole range of input sizes. Experiments demonstrate that our new Reduce and AllReduce algorithms outperform the current vendor solution by up to 3.27x. Additionally, our model predicts performance with less than 4% error. The proposed communication collectives increase the range of HPC applications that can benefit from the high throughput of the WSE. Our model-driven methodology demonstrates a disciplined approach that can lead the way to further algorithmic advancements on wafer-scale architectures.
翻译:高效能的Reduce与AllReduce通信集合操作是高性能计算(HPC)应用的关键基石。我们首次对Cerebras晶圆级引擎(WSE)上的Reduce和AllReduce进行了系统性研究。该架构已被证明在机器学习工作负载及其他计算问题(如FFT)中均能实现前所未有的性能表现。我们引入了一个性能模型来估算WSE上算法的执行时间,并通过涵盖广泛输入规模的实验验证了预测结果。除现有实现外,我们专门针对该架构设计并实现了若干新算法。此外,我们建立了WSE上Reduce操作运行时间的下界。基于该模型,我们自动生成了能在全输入规模范围内实现近最优性能的代码。实验表明,我们的新型Reduce和AllReduce算法性能比现有供应商方案提升高达3.27倍。同时,模型预测误差小于4%。所提出的通信集合操作拓展了可从WSE高吞吐量中受益的HPC应用范围。这种模型驱动的方法论展示了严谨的研究范式,可为晶圆级架构上的进一步算法创新指明方向。