Today's high-performance architectures are increasingly constrained by data movement latency and energy overhead, as the slowdown of single-core performance scaling coincides with the rise of highly data-intensive workloads. In-memory architectures have emerged as a complementary solution to conventional von Neumann systems by alleviating memory bandwidth bottlenecks, exploiting massive concurrency, and mitigating excessive data movement between memory and processing units. This study proposes a parallel in-memory stochastic computing (SC) architecture that implements an end-to-end computation pipeline within Magnetic Tunnel Junction (MTJ)-based memory augmented with logic-in-memory (LIM) capabilities. By leveraging the inherent stochasticity and write-read characteristics of MTJ devices, the proposed architecture enables a fully parallel and deterministic conversion of binary operands into probabilistic bit-streams, eliminating the need for energy-intensive external random number generation circuitry. These bit-streams are processed by parallel stochastic arithmetic units integrated directly within the memory arrays to efficiently implement core arithmetic and transcendental functions with minimal hardware complexity and inherent noise tolerance. The resulting stochastic outputs can be either reused as an input of future stochastic processing or converted back to binary form using parallel accumulation mechanisms and stored in the MTJ memory. By tightly integrating data storage, bit-stream generation, and computation within a unified in-memory fabric, the proposed design maximizes memory-level parallelism while substantially minimizing data movement.
翻译:当今高性能架构日益受限于数据移动延迟和能耗开销,单核性能扩展放缓的同时高度数据密集型工作负载持续增长。内存架构通过缓解内存带宽瓶颈、利用大规模并发性以及减少内存与处理单元间的过度数据移动,已成为传统冯·诺依曼系统的补充解决方案。本研究提出一种并行内存随机计算架构,在基于磁隧道结的存储器中实现端到端计算流水线,该存储器通过逻辑内存能力增强。通过利用MTJ器件的固有随机性与读写特性,所提架构能够将二进制操作数完全并行且确定性地转换为概率比特流,从而消除对高能耗外部随机数生成电路的需求。这些比特流由直接集成在存储阵列内的并行随机算术单元处理,以最小硬件复杂度和固有噪声容限高效实现核心算术与超越函数。生成的随机输出既可被复用为后续随机处理的输入,也可通过并行累加机制转换回二进制形式并存储于MTJ存储器中。通过将数据存储、比特流生成与计算紧密集成于统一内存架构,本设计在最大程度减少数据移动的同时,实现了内存级并行性的最大化。