DS-CIM：通过样本区域重映射实现精确OR累加的数字随机存内计算架构，面向边缘AI模型 (DS-CIM: Digital Stochastic Computing-In-Memory Featuring Accurate OR-Accumulation via Sample Region Remapping for Edge AI Models)

Stochastic computing (SC) offers hardware simplicity but suffers from low throughput, while high-throughput Digital Computing-in-Memory (DCIM) is bottlenecked by costly adder logic for matrix-vector multiplication (MVM). To address this trade-off, this paper introduces a digital stochastic CIM (DS-CIM) architecture that achieves both high accuracy and efficiency. We implement signed multiply-accumulation (MAC) in a compact, unsigned OR-based circuit by modifying the data representation. Throughput is enhanced by replicating this low-cost circuit 64 times with only a 1x area increase. Our core strategy, a shared Pseudo Random Number Generator (PRNG) with 2D partitioning, enables single-cycle mutually exclusive activation to eliminate OR-gate collisions. We also resolve the 1s saturation issue via stochastic process analysis and data remapping, significantly improving accuracy and resilience to input sparsity. Our high-accuracy DS-CIM1 variant achieves 94.45% accuracy for INT8 ResNet18 on CIFAR-10 with a root-mean-squared error (RMSE) of just 0.74%. Meanwhile, our high-efficiency DS-CIM2 variant attains an energy efficiency of 3566.1 TOPS/W and an area efficiency of 363.7 TOPS/mm^2, while maintaining a low RMSE of 3.81%. The DS-CIM capability with larger models is further demonstrated through experiments with INT8 ResNet50 on ImageNet and the FP8 LLaMA-7B model.

翻译：随机计算（SC）具有硬件简单的优势，但其吞吐量较低；而高吞吐量的数字存内计算（DCIM）则受限于矩阵向量乘法（MVM）中昂贵的加法器逻辑带来的瓶颈。为了解决这一权衡问题，本文提出了一种数字随机存内计算（DS-CIM）架构，该架构同时实现了高精度与高效率。我们通过修改数据表示方式，在紧凑的无符号OR电路中实现了有符号乘积累加（MAC）运算。通过将该低成本电路复制64次而仅带来1倍面积开销，吞吐量得到了提升。我们的核心策略是采用具有二维分区结构的共享伪随机数生成器（PRNG），实现了单周期互斥激活，从而消除了OR门冲突。我们还通过随机过程分析和数据重映射解决了“1”饱和问题，显著提高了精度和对输入稀疏性的鲁棒性。我们的高精度变体DS-CIM1在CIFAR-10数据集上对INT8 ResNet18模型实现了94.45%的准确率，均方根误差（RMSE）仅为0.74%。同时，我们的高效率变体DS-CIM2实现了3566.1 TOPS/W的能效和363.7 TOPS/mm^2的面积效率，同时保持了3.81%的低RMSE。通过在ImageNet数据集上对INT8 ResNet50模型以及FP8 LLaMA-7B模型进行的实验，进一步证明了DS-CIM处理更大模型的能力。