Processing-in-cache (PiC) and Processing-in-memory (PiM) architectures, especially those utilizing bit-line computing, offer promising solutions to mitigate data movement bottlenecks within the memory hierarchy. While previous studies have explored the integration of compute units within individual memory levels, the complexity and potential overheads associated with these designs have often limited their capabilities. This paper introduces a novel PiC/PiM architecture, Concurrent Hierarchical In-Memory Processing (CHIME), which strategically incorporates heterogeneous compute units across multiple levels of the memory hierarchy. This design targets the efficient execution of diverse, domain-specific workloads by placing computations closest to the data where it optimizes performance, energy consumption, data movement costs, and area. CHIME employs STT-RAM due to its various advantages in PiC/PiM computing, such as high density, low leakage, and better resiliency to data corruption from activating multiple word lines. We demonstrate that CHIME enhances concurrency and improves compute unit utilization at each level of the memory hierarchy. We present strategies for exploring the design space, grouping, and placing the compute units across the memory hierarchy. Experiments reveal that, compared to the state-of-the-art bit-line computing approaches, CHIME achieves significant speedup and energy savings of 57.95% and 78.23% for various domain-specific workloads, while reducing the overheads associated with single-level compute designs.
翻译:缓存内处理(PiC)与内存内处理(PiM)架构,特别是利用位线计算的技术,为缓解内存层次结构内的数据移动瓶颈提供了有前景的解决方案。尽管已有研究探索在单一内存层级内集成计算单元,但这些设计的复杂性和潜在开销往往限制了其能力。本文提出了一种新颖的PiC/PiM架构——并发层次化内存处理(CHIME),该架构策略性地在内存层次结构的多个层级中集成异构计算单元。此设计通过将计算放置在最接近数据的位置,以优化性能、能耗、数据移动成本和面积,从而高效执行多样化的领域专用工作负载。CHIME采用STT-RAM,因其在PiC/PiM计算中具有高密度、低漏电以及对多字线激活所致数据损坏的较强耐受性等优势。我们证明,CHIME增强了并发性,并提高了内存层次结构每一层级中计算单元的利用率。我们提出了探索设计空间、跨内存层次结构分组与放置计算单元的策略。实验表明,相较于最先进的位线计算方法,CHIME在多种领域专用工作负载上实现了显著的加速和节能,分别达到57.95%和78.23%,同时降低了单层级计算设计相关的开销。