Count2Multiply: Reliable In-memory High-Radix Counting

Big data processing has exposed the limits of compute-centric hardware acceleration due to the memory-to-processor bandwidth bottleneck. Consequently, there has been a shift towards memory-centric architectures, leveraging substantial compute parallelism by processing using the memory elements directly. Computing-in-memory (CIM) proposals for both conventional and emerging memory technologies often target massively parallel operations. However, current CIM solutions face significant challenges. For emerging data-intensive applications, such as advanced machine learning techniques and bioinformatics, where matrix multiplication is a key primitive, memristor crossbars suffer from limited write endurance and expensive write operations. In contrast, while DRAM-based solutions have successfully demonstrated multiplication using additions, they remain prohibitively slow. This paper introduces Count2Multiply, a technology-agnostic digital-CIM method for performing integer-binary and integer-integer matrix multiplications using high-radix, massively parallel counting implemented with bitwise logic operations. In addition, Count2Multiply is designed with fault tolerance in mind and leverages traditional scalable row-wise error correction codes, such as Hamming and BCH codes, to protect against the high error rates of existing CIM designs. We demonstrate Count2Multiply with a detailed application to CIM in conventional DRAM due to its ubiquity and high endurance. We also explore the acceleration potential of racetrack memories due to their shifting properties, which are natural for Count2Multiply, and their high endurance. Compared to the state-of-the-art in-DRAM method, Count2Multiply achieves up to 10x speedup, 3.8x higher GOPS/Watt, and 1.4x higher GOPS/area, while the RTM counterpart offers gains of 10x, 57x, and 3.8x.

翻译：大数据处理暴露了以计算为中心的硬件加速的局限性，这源于内存与处理器之间的带宽瓶颈。因此，研究重心已转向以内存为中心的架构，通过直接利用内存单元进行计算来发挥大规模计算并行性。针对传统和新兴存储技术的存内计算方案通常瞄准大规模并行操作。然而，当前的存内计算解决方案面临着重大挑战。对于新兴的数据密集型应用，例如先进的机器学习技术和生物信息学，其中矩阵乘法是关键原语，忆阻器交叉阵列存在写入耐久性有限和写入操作成本高昂的问题。相比之下，基于DRAM的解决方案虽然已成功演示了通过加法实现乘法，但其速度仍然极其缓慢。本文介绍了Count2Multiply，这是一种与具体技术无关的数字存内计算方法，它通过使用基于位逻辑运算实现的高基数、大规模并行计数，来执行整数-二进制和整数-整数矩阵乘法。此外，Count2Multiply在设计时考虑了容错性，并利用传统的可扩展行向纠错码，如汉明码和BCH码，以应对现有存内计算设计的高错误率。鉴于其普遍性和高耐久性，我们通过将其详细应用于传统DRAM中的存内计算来演示Count2Multiply。我们还探索了赛道存储器由于其移位特性（这天然适合Count2Multiply）和高耐久性而带来的加速潜力。与最先进的DRAM内存计算方法相比，Count2Multiply实现了高达10倍的加速、3.8倍的GOPS/Watt提升和1.4倍的GOPS/面积提升，而其对应的赛道存储器方案则分别提供了10倍、57倍和3.8倍的增益。