It looks attractive to coordinate racetrack-memory(RM) and stochastic-computing (SC) jointly to build an ultra-low power neuron-architecture. However, the above combination has always been questioned in a fatal weakness that the narrow bit-view of the RM-MTJ structure, a.k.a. shift-and-access pattern, cannot physically match the great throughput of direct-stored stochastic sequences. Fortunately, a recently developed Transverse-Read(TR) provides a wider segment-view to RM via detecting the resistance of domain-walls between a couple of MTJs on single nanowire, therefore RM can be enhanced with a faster access to the sequences without any substantial domain-shift. To utilize TR for a power-efficient SC-DNNs, we propose a segment-based compression to leverage one-cycle TR to only read those kernel segments of stochastic sequences, meanwhile, remove a large number of redundant segments for ultra-high storage density. In decompression stage, low-discrepancy stochastic sequences can be quickly reassembled by a select-and-output loop using kernel segments rather than slowly regenerated by costly SNGs. Since TR can provide an ideal in-memory acceleration in one-counting, counter-free SC-MACs are designed and deployed near RMs to form a power-efficient neuron-architecture, in which, the binary results of TR are activated straightforward without sluggish APCs. The results show that under the TR aided RM model, the power efficiency, speed, and stochastic accuracy of Seed-based Fast Stochastic Computing significantly enhance the performance of DNNs. The speed of computation is 2.88x faster in Lenet-5 and 4.40x faster in VGG-19 compared to the CORUSCANT. The integration of TR with RTM is deployed near the memory to create a power-efficient neuron architecture, eliminating the need for slow Accumulative Parallel Counters (APCs) and improving access speed to stochastic sequences.
翻译:将赛道存储器与随机计算协同构建超低功耗神经元架构颇具吸引力。然而,该组合始终存在一个致命缺陷:RM-MTJ结构的窄位视图(即移位访问模式)在物理上无法匹配直接存储随机序列的高吞吐量。幸运的是,最近发展的横向读取技术通过检测单根纳米线上多个MTJ间畴壁的电阻,为RM提供了更宽的段视图,从而可在无需实质性畴移的情况下实现对序列的快速访问。为将TR应用于高能效SC-DNN,我们提出基于段的压缩方法,利用单周期TR仅读取随机序列的核心段,同时移除大量冗余段以实现超高存储密度。在解压缩阶段,低差异随机序列可通过核心段的选择-输出循环快速重组,而非依赖昂贵的随机数生成器缓慢重构。由于TR能在单次计数中提供理想的内存加速,我们设计无计数器SC-MAC单元并部署于RM附近,构建高能效神经元架构,其中TR的二进制结果可直接激活而无需缓慢的累加并行计数器。实验表明,在TR辅助的RM模型下,基于种子的快速随机计算在能效、速度和随机精度上的提升显著增强了DNN性能。与CORUSCANT相比,LeNet-5计算速度提升2.88倍,VGG-19提升4.40倍。TR与RTM的集成部署在存储器近端,形成了高能效神经元架构,消除了对慢速累加并行计数器的需求,并提升了对随机序列的访问速度。