Fault-Tolerant Masked Matrix Accumulation using Bulk Bitwise In-Memory Engines

Big data processing has exposed the limits of compute-centric hardware acceleration due to the memory-to-processor bandwidth bottleneck. Consequently, there has been a shift towards memory-centric architectures, leveraging substantial compute parallelism by processing using the memory elements directly. Computing-in-memory (CIM) proposals for both conventional and emerging memory technologies often target massively parallel operations. However, current CIM solutions face significant challenges. For emerging data-intensive applications, such as advanced machine learning techniques and bioinformatics, where matrix multiplication is a key primitive, memristor crossbars suffer from limited write endurance and expensive write operations. In contrast, while DRAM-based solutions have successfully demonstrated multiplication using additions, they remain prohibitively slow. This paper introduces Count2Multiply, a technology-agnostic digital-CIM method for performing integer-binary and integer-integer matrix multiplications using high-radix, massively parallel counting implemented with bitwise logic operations. In addition, Count2Multiply is designed with fault tolerance in mind and leverages traditional scalable row-wise error correction codes, such as Hamming and BCH codes, to protect against the high error rates of existing CIM designs. We demonstrate Count2Multiply with a detailed application to CIM in conventional DRAM due to its ubiquity and high endurance. We also explore the acceleration potential of racetrack memories due to their shifting properties, which are natural for Count2Multiply, and their high endurance. Compared to the state-of-the-art in-DRAM method, Count2Multiply achieves up to 10x speedup, 3.8x higher GOPS/Watt, and 1.4x higher GOPS/area, while the RTM counterpart offers gains of 10x, 57x, and 3.8x.

翻译：大数据处理暴露了以计算为中心的硬件加速的局限性，这源于内存与处理器之间的带宽瓶颈。因此，研究重心已转向以内存为中心的架构，通过直接利用内存单元进行计算来发挥大规模计算并行性。针对传统及新兴内存技术的存内计算方案通常瞄准大规模并行操作。然而，当前的存内计算解决方案面临重大挑战。对于新兴的数据密集型应用，如先进的机器学习技术和生物信息学，其中矩阵乘法是关键原语，忆阻器交叉阵列受限于有限的写入耐久性和昂贵的写入操作。相比之下，基于DRAM的解决方案虽已成功演示了利用加法实现乘法，但其速度仍然极其缓慢。本文介绍了Count2Multiply，这是一种与具体技术无关的数字存内计算方法，它通过使用基于位式逻辑运算实现的高基数、大规模并行计数，来执行整数-二进制及整数-整数矩阵乘法。此外，Count2Multiply在设计时考虑了容错性，并利用传统的可扩展行级纠错码，如汉明码和BCH码，以应对现有存内计算设计的高错误率。我们详细展示了Count2Multiply在传统DRAM中实现存内计算的应用，这得益于DRAM的普遍性和高耐久性。我们还探索了赛道内存的加速潜力，因其移位特性天然适合Count2Multiply，并且具有高耐久性。与最先进的DRAM内存计算方法相比，Count2Multiply实现了高达10倍的加速、3.8倍的每瓦特GOPS提升以及1.4倍的每单位面积GOPS提升，而其赛道内存对应方案则分别提供了10倍、57倍和3.8倍的增益。