In recent years, the growing demand to process large graphs and sparse datasets has led to increased research efforts to develop hardware- and software-based architectural solutions to accelerate them. While some of these approaches achieve scalable parallelization with up to thousands of cores, adaptation of these proposals by the industry remained slow. To help solve this dissonance, we identified a set of questions and considerations that current research has not considered deeply. Starting from a tile-based architecture, we put forward a Distributed Chiplet-based Reconfigurable Architecture (DCRA) for irregular applications that carefully consider fabrication constraints that made prior work either hard or costly to implement or too rigid to be applied. We identify and study pre-silicon, package-time and compile-time configurations that help optimize DCRA for different deployments and target metrics. To enable that, we propose a practical path for manufacturing chip packages by composing variable numbers of DCRA and memory dies, with a software-configurable Torus network to connect them. We evaluate six applications and four datasets, with several configurations and memory technologies, to provide a detailed analysis of the performance, power, and cost of DCRA as a compute node for scale-out sparse data processing. Finally, we present our findings and discuss how DCRA, together with our framework for design exploration, can help guide architects to build scalable and cost-efficient systems for irregular applications.
翻译:近年来,处理大规模图和稀疏数据集的需求日益增长,推动了基于硬件和软件的架构解决方案加速处理此类数据的科研努力。尽管部分方法实现了可扩展的并行化,可支持数千个核心,但这些提议在工业界的采纳仍然缓慢。为弥合这一脱节,我们识别出一系列当前研究未深入探讨的问题与考量。基于拼块式架构,我们提出了一种面向不规则应用的分布式芯片化可重构架构(DCRA),该架构审慎考虑了制造限制,这些限制使得先前的工作要么实现困难或成本高昂,要么过于僵化而难以应用。我们识别并研究了预硅片、封装阶段和编译阶段的配置,这些配置有助于针对不同部署场景和目标指标优化DCRA。为此,我们提出了一条实用的芯片封装制造路径:通过组合可变数量的DCRA与存储晶粒,并采用软件可配置的环状网络进行连接。我们评估了六个应用和四个数据集,采用多种配置与存储技术,从而详细分析DCRA作为扩展式稀疏数据处理计算节点在性能、功耗和成本方面的表现。最后,我们呈现研究结果,并讨论DCRA如何与我们的设计探索框架协同,助力架构师构建面向不规则应用的可扩展且高性价比系统。