Markov chain Monte Carlo (MCMC) is a widely used sampling method in modern artificial intelligence and probabilistic computing systems. It involves repetitive random number generations and thus often dominates the latency of probabilistic model computing. Hence, we propose a compute-in-memory (CIM) based MCMC design as a hardware acceleration solution. This work investigates SRAM bitcell stochasticity and proposes a novel ``pseudo-read'' operation, based on which we offer a block-wise random number generation circuit scheme for fast random number generation. Moreover, this work proposes a novel multi-stage exclusive-OR gate (MSXOR) design method to generate strictly uniformly distributed random numbers. The probability error deviating from a uniform distribution is suppressed under $10^{-5}$. Also, this work presents a novel in-memory copy circuit scheme to realize data copy inside a CIM sub-array, significantly reducing the use of R/W circuits for power saving. Evaluated in a commercial 28-nm process development kit, this CIM-based MCMC design generates 4-bit$\sim$32-bit samples with an energy efficiency of $0.53$~pJ/sample and high throughput of up to $166.7$M~samples/s. Compared to conventional processors, the overall energy efficiency improves $5.41\times10^{11}$ to $2.33\times10^{12}$ times.
翻译:马尔可夫链蒙特卡洛(MCMC)是当代人工智能与概率计算系统中广泛采用的采样方法。该方法涉及重复的随机数生成,常成为概率模型计算的延时瓶颈。为此,我们提出一种基于计算内存(CIM)的MCMC设计方案作为硬件加速方案。本研究探究了SRAM存储单元的随机性,并提出一种新型“伪读取”操作,以此为基础设计出用于快速随机数生成的块式随机数发生电路方案。进一步地,本研究提出一种新型多级异或门(MSXOR)设计方法,用于生成严格均匀分布的随机数,将偏离均匀分布的概率误差抑制在$10^{-5}$以下。同时,本研究还提出一种新型内存复制电路方案,可在CIM子阵列内部实现数据复制,显著减少读写电路的使用以降低功耗。经商用28纳米工艺开发包评估,该基于CIM的MCMC设计可生成4位至32位采样值,能效达$0.53$皮焦/采样,吞吐量高达$1.667$亿采样/秒。与传统处理器相比,整体能效提升$5.41\times10^{11}$至$2.33\times10^{12}$倍。