The dominance of machine learning and the ending of Moore's law have renewed interests in Processor in Memory (PIM) architectures. This interest has produced several recent proposals to modify an FPGA's BRAM architecture to form a next-generation PIM reconfigurable fabric. PIM architectures can also be realized within today's FPGAs as overlays without the need to modify the underlying FPGA architecture. To date, there has been no study to understand the comparative advantages of the two approaches. In this paper, we present a study that explores the comparative advantages between two proposed custom architectures and a PIM overlay running on a commodity FPGA. We created PiCaSO, a Processor in/near Memory Scalable and Fast Overlay architecture as a representative PIM overlay. The results of this study show that the PiCaSO overlay achieves up to 80% of the peak throughput of the custom designs with 2.56x shorter latency and 25% - 43% better BRAM memory utilization efficiency. We then show how several key features of the PiCaSO overlay can be integrated into the custom PIM designs to further improve their throughput by 18%, latency by 19.5%, and memory efficiency by 6.2%.
翻译:机器学习的主导地位与摩尔定律的终结重新激发了人们对处理器-内存(PIM)架构的兴趣。这一趋势催生了多项近期提案,旨在修改FPGA的BRAM架构以构建下一代可重构PIM结构。PIM架构也可在现有FPGA中以层叠方式实现,无需修改底层FPGA架构。迄今为止,尚无研究探讨这两种方法的相对优势。本文通过对比两个定制架构方案与在商用FPGA上运行的PIM层叠架构,系统分析了其性能优劣。我们提出了PiCaSO——一种面向处理器内存内/近计算的快速可扩展层叠架构,作为具有代表性的PIM层叠方案。研究结果表明,PiCaSO层叠架构可实现定制设计80%的峰值吞吐量,同时延迟降低2.56倍,BRAM内存利用效率提升25%-43%。进一步研究表明,将PiCaSO层叠架构的关键特性集成至定制PIM设计中,可分别提升吞吐量18%、降低延迟19.5%、提高内存效率6.2%。