In recent years, the growing demand to process large graphs and sparse datasets has led to increased research efforts to develop hardware- and software-based architectural solutions to accelerate them. While some of these approaches achieve scalable parallelization with up to thousands of cores, adaptation of these proposals by the industry remained slow. To help solve this dissonance, we identified a set of questions and considerations that current research has not considered deeply. Starting from a tile-based architecture, we put forward a Distributed Chiplet-based Reconfigurable Architecture (DCRA) for irregular applications that carefully consider fabrication constraints that made prior work either hard or costly to implement or too rigid to be applied. We identify and study pre-silicon, package-time and compile-time configurations that help optimize DCRA for different deployments and target metrics. To enable that, we propose a practical path for manufacturing chip packages by composing variable numbers of DCRA and memory dies, with a software-configurable Torus network to connect them. We evaluate six applications and four datasets, with several configurations and memory technologies, to provide a detailed analysis of the performance, power, and cost of DCRA as a compute node for scale-out sparse data processing. Finally, we present our findings and discuss how DCRA's framework for design exploration can help guide architects to build scalable and cost-efficient systems for irregular applications.
翻译:近年来,处理大型图和稀疏数据集的需求日益增长,促使研究人员加大力度开发基于硬件和软件的架构解决方案以加速此类任务。尽管部分方法实现了可扩展的并行化(支持多达数千个核心),但业界对这些方案的采纳仍然缓慢。为弥合这一分歧,我们识别出一系列当前研究尚未深入探讨的问题与考量因素。从基于瓦片的架构出发,我们提出了一种面向不规则应用的分布式基于芯片粒的可重构架构(DCRA),该架构审慎考虑了制造约束——这些约束使得先前的工作要么难以或高成本实现,要么过于僵化难以应用。我们识别并研究了硅前阶段、封装阶段和编译阶段的配置方法,以帮助针对不同部署场景和目标指标优化DCRA。为实现这一目标,我们提出了一条实用的芯片封装制造路径:通过组合可变数量的DCRA芯片粒与内存芯片粒,并采用软件可配置的环面网络进行互连。我们评估了六种应用和四种数据集,结合多种配置和内存技术,对DCRA作为用于规模化稀疏数据处理的计算节点的性能、功耗和成本进行了详细分析。最后,我们呈现了研究发现,并讨论了DCRA的设计探索框架如何指导架构师为不规则应用构建可扩展且高成本效益的系统。