While general-purpose computing follows Von Neumann's architecture, the data movement between memory and processor elements dictates the processor's performance. The evolving compute-in-memory (CiM) paradigm tackles this issue by facilitating simultaneous processing and storage within static random-access memory (SRAM) elements. Numerous design decisions taken at different levels of hierarchy affect the figure of merits (FoMs) of SRAM, such as power, performance, area, and yield. The absence of a rapid assessment mechanism for the impact of changes at different hierarchy levels on global FoMs poses a challenge to accurately evaluating innovative SRAM designs. This paper presents an automation tool designed to optimize the energy and latency of SRAM designs incorporating diverse implementation strategies for executing logic operations within the SRAM. The tool structure allows easy comparison across different array topologies and various design strategies to result in energy-efficient implementations. Our study involves a comprehensive comparison of over 6900+ distinct design implementation strategies for EPFL combinational benchmark circuits on the energy-recycling resonant compute-in-memory (rCiM) architecture designed using TSMC 28 nm technology. When provided with a combinational circuit, the tool aims to generate an energy-efficient implementation strategy tailored to the specified input memory and latency constraints. The tool reduces 80.9% of energy consumption on average across all benchmarks while using the six-topology implementation compared to baseline implementation of single-macro topology by considering the parallel processing capability of rCiM cache size ranging from 4KB to 192KB.
翻译:尽管通用计算遵循冯·诺依曼架构,但存储器与处理器单元间的数据迁移决定了处理器性能。不断发展的存内计算范式通过在静态随机存取存储器单元内实现同步处理与存储来解决这一问题。不同层级架构的众多设计决策会影响SRAM的品质因数,如功耗、性能、面积和良率。由于缺乏快速评估机制来量化不同层级变更对全局品质因数的影响,这给准确评估创新型SRAM设计带来了挑战。本文提出一种自动化工具,旨在优化集成多种SRAM逻辑运算实现策略的SRAM设计的能耗与延迟。该工具结构支持对不同阵列拓扑及多样化设计策略进行便捷比较,从而获得高能效实施方案。本研究基于TSMC 28 nm工艺设计的能量回收谐振存内计算架构,对EPFL组合基准电路超过6900种不同设计实现策略进行了全面比较。当输入组合电路时,该工具可根据指定的输入存储器和延迟约束,生成定制化的高能效实现策略。通过利用rCiM缓存(容量范围4KB至192KB)的并行处理能力,采用六拓扑实施方案相比单宏拓扑基线方案,所有基准测试的平均能耗降低了80.9%。