Spatial dataflow architectures like the Cerebras Wafer-Scale Engine deliver exceptional performance in AI and scientific computing by distributing scratchpad memory across hundreds of thousands of processing elements (PEs). Yet programming these architectures remains difficult: with no shared memory, data movement requires explicit configuration, and asynchronous task management introduces substantial complexity. We present SpaDA, a programming language that offers precise control over data placement, dataflow patterns, and asynchronous operations while abstracting low-level architectural details. We design and implement a compiler targeting Cerebras CSL through multi-level lowering and unique optimization passes. SpaDA functions as a high-level programming interface and an intermediate representation for domain-specific languages (DSLs), demonstrated here with the GT4Py stencil DSL. SpaDA enables concise expression of operations with complex parallel patterns -- including pipelined collective operations, multi-dimensional stencils, and dense linear algebra -- in 14.09x fewer lines than CSL, achieving over 260 TFlop/s across 730,000 PEs on a single device.
翻译:像Cerebras晶圆级引擎这样的空间数据流架构,通过将暂存器分散分布在数十万个处理单元(PE)中,在人工智能和科学计算领域展现出卓越性能。然而,这些架构的编程仍然困难重重:由于没有共享内存,数据移动需要显式配置,异步任务管理也引入了大量复杂性。我们提出SpaDA,这是一种编程语言,能够在抽象底层架构细节的同时,提供对数据放置、数据流模式和异步操作的精确控制。我们设计并实现了一个面向Cerebras CSL的编译器,通过多层次降级和专用优化流程完成目标编译。SpaDA可作为高层编程接口和领域特定语言(DSL)的中间表示,本文以GT4Py模板DSL为例进行了演示。SpaDA能够用比CSL少14.09倍的代码行数,简洁地表达具有复杂并行模式的操作——包括流水线集合操作、多维模板和稠密线性代数——在单个设备上跨730,000个PE实现了超过260 TFlop/s的性能。