Massive Data-Centric Parallelism in the Chiplet Era

Mapping communication-intensive workloads to distributed systems requires complicated problem partitioning and dataset pre-processing. With the current AI-driven trend of having thousands of interconnected processors per chip, there is an opportunity to re-think these communication-bottlenecked workloads. This bottleneck often arises from the data structure traversals, which cause irregular memory access patterns and poor cache locality. Recent works have introduced task-based parallelization schemes to accelerate graph traversal and other sparse workloads. Of these, Dalorex demonstrated high scalability by having the entire dataset on-chip, scattered across processing units (PU), and executing the tasks at the PU where the data is local. However, the communication needs of this approach do not scale with system sizes beyond 10k cores, and both the ability to handle larger datasets and how to achieve a cost-efficient design for production remain unanswered. To address these challenges, we propose a throughput-aware scalable chiplet architecture for distributed execution (Tascade), a multi-node system design that we evaluate with up to 256 distributed chips, a total of 1 million PUs. We introduce a programming model that scales to this level through proxy regions and selective cascading that reduce communication needs and improve load balancing. In addition, package-time reconfiguration of our large-scale chip design enables creating chip products that optimize for different target metrics, such as time-to-solution, energy, or cost. We evaluate six applications and four datasets, with several configurations and memory technologies to provide a detailed analysis of the performance, power, and cost of data-local execution at scale. Our parallelization of Breadth-First-Search with RMAT-26 across a million PUs, the largest of the literature, reaches 3021 GTEPS.

翻译：将通信密集型工作负载映射到分布式系统需要复杂的问题划分和数据集预处理。在当前每片芯片集成数千个互连处理器的AI驱动趋势下，重新审视这些受通信瓶颈制约的工作负载成为可能。该瓶颈通常源于数据结构遍历，导致不规则的内存访问模式和较差的缓存局部性。近期研究引入了基于任务的并行化方案以加速图遍历及其他稀疏型工作负载。其中，Dalorex 通过将完整数据集放在芯片上并分散于处理单元（PU）之间，且在数据所在的PU上执行任务，展现了高可扩展性。然而，该方法在核心数超过1万时通信需求无法随系统规模扩展，且处理更大数据集的能力及实现生产级成本高效设计的途径仍未解决。针对这些挑战，我们提出一种面向分布式执行的吞吐感知可扩展芯粒架构（Tascade），这是一种多节点系统设计，我们在多达256个分布式芯片（总计100万个PU）上进行了评估。我们引入一种编程模型，通过代理区域和选择性级联实现该规模下的可扩展性，从而降低通信需求并改善负载均衡。此外，通过芯片级封装的运行时重构，我们能够创建针对不同目标指标（如求解时间、能耗或成本）优化的芯片产品。我们评估了六类应用和四个数据集，结合多种配置与存储技术，对大规模数据本地化执行的性能、功耗和成本进行了详细分析。我们对RMAT-26图广度优先搜索的并行化在百万PU上达到3021 GTEPS，创下文献中最大规模纪录。