A Transverse-Read-assisted Valid-Bit Collection to Accelerate Stochastic Conmputing MAC for Energy-Efficient in-RTM DNNs

It looks very attractive to coordinate racetrack-memory(RM) and stochastic-computing (SC) jointly to build an ultra-low power neuron-architecture.However,the above combination has always been questioned in a fatal weakness that the narrow bit-view of the RM-MTJ structure,a.k.a.shift-and-access pattern,cannot physically match the great throughput of direct-stored stochastic sequences.Fortunately,a recently developed Transverse-Read(TR) provides a wider segment-view to RM via detecting the resistance of domain-walls between a couple of MTJs on single nanowire,therefore RM can be enhanced with a faster access to the sequences without any substantial domain-shift.To utilize TR for a power-efficient SC-DNNs, in this work, we propose a segment-based compression to leverage one-cycle TR to only read those kernel segments of stochastic sequences,meanwhile,remove a large number of redundant segments for ultra-high storage density.In decompression stage,the low-discrepancy stochastic sequences can be quickly reassembled by a select-and-output loop using kernel segments rather than slowly regenerated by costly SNGs.Since TR can provide an ideal in-memory acceleration in one-counting, counter-free SC-MACs are designed and deployed near RMs to form a power-efficient neuron-architecture,in which,the binary results of TR are activated straightforward without sluggish APCs.The results show that under the TR aided RM model,the power efficiency,speed,and stochastic accuracy of Seed-based Fast Stochastic Computing significantly enhance the performance of DNNs.The speed of computation is 2.88x faster in Lenet-5 and 4.40x faster in VGG-19 compared to the CORUSCANT model.The integration of TR with RTM is deployed near the memory to create a power-efficient neuron architecture, eliminating the need for slow Accumulative Parallel Counters (APCs) and improving access speed to stochastic sequences.

翻译：将赛道存储器(RM)与随机计算(SC)协同设计以构建超低功耗神经元架构极具吸引力。然而，上述组合始终存在一个致命缺陷：RM-MTJ结构的窄位视图（即移位访问模式）在物理上无法匹配直接存储随机序列的高吞吐量。幸运的是，最近发展的横向读取(TR)技术通过检测单根纳米线上多个MTJ之间畴壁的电阻，为RM提供了更宽的段视图，从而可在无需实质性畴移的情况下实现对序列的更快访问。为利用TR实现高能效能SC-DNN，本工作提出一种基于段的压缩方法，利用单周期TR仅读取随机序列的核心段，同时移除大量冗余段以实现超高存储密度。在解压缩阶段，低差异随机序列可通过基于核心段的选择-输出循环快速重组，而非依赖昂贵的随机数生成器(SNG)缓慢重构。由于TR能在单次计数中提供理想的内存内加速，我们设计并部署了无计数器SC-MAC单元靠近RM，形成高能效神经元架构，其中TR的二进制结果可直接激活而无需缓慢的累加并行计数器(APC)。实验结果表明，在TR辅助的RM模型下，基于种子的快速随机计算在能效、速度和随机精度方面的显著提升增强了DNN性能。与CORUSCANT模型相比，LeNet-5计算速度提升2.88倍，VGG-19提升4.40倍。TR与RTM的集成部署在存储器近端，构建了高能效神经元架构，消除了对缓慢累加并行计数器(APC)的需求，并提升了对随机序列的访问速度。