With the ever-growing heterogeneity in computing systems, driven by modern machine learning applications, pressure is increasing on memory systems to handle arbitrary and more demanding transfers efficiently. Descriptor-based direct memory access controllers (DMACs) allow such transfers to be executed by decoupling memory transfers from processing units. Classical descriptor-based DMACs are inefficient when handling arbitrary transfers of small unit sizes. Excessive descriptor size and the serialized nature of processing descriptors employed by the DMAC lead to large static overheads when setting up transfers. To tackle this inefficiency, we propose a descriptor-based DMAC optimized to efficiently handle arbitrary transfers of small unit sizes. We implement a lightweight descriptor format in an AXI4-based DMAC. We further increase performance by implementing a low-overhead speculative descriptor prefetching scheme without additional latency penalties in the case of a misprediction. Our DMAC is integrated into a 64-bit Linux-capable RISC-V SoC and emulated on a Kintex FPGA to evaluate its performance. Compared to an off-the-shelf descriptor-based DMAC IP, we achieve 1.66x less latency launching transfers, increase bus utilization up to 2.5x in an ideal memory system with 64-byte-length transfers while requiring 11% fewer lookup tables, 23% fewer flip-flops, and no block RAMs. We can extend our lead in bus utilization to 3.6x with 64-byte-length transfers in deep memory systems. We synthesized our DMAC in GlobalFoundries' GF12LP+ node, achieving a clock frequency of over 1.44 GHz while occupying only 49.5 kGE.
翻译:随着现代机器学习应用推动计算系统异构性日益增强,内存系统处理任意且要求更高的数据传输的压力不断增大。基于描述符的直接内存访问控制器(DMAC)通过将内存传输与处理单元解耦,允许执行此类传输。传统的基于描述符的DMAC在处理小单元尺寸的任意传输时效率低下。描述符尺寸过大以及DMAC处理描述符的串行特性,导致在建立传输时产生大量静态开销。为解决这一低效问题,我们提出了一种优化的基于描述符的DMAC,旨在高效处理小单元尺寸的任意传输。我们在一个基于AXI4的DMAC中实现了一种轻量级描述符格式。我们通过实现一种低开销的推测性描述符预取方案进一步提升了性能,该方案在预测错误时不会产生额外的延迟惩罚。我们的DMAC被集成到一个支持64位Linux的RISC-V片上系统中,并在Kintex FPGA上进行仿真以评估其性能。与一款现成的基于描述符的DMAC IP相比,我们在启动传输时实现了1.66倍的延迟降低,在具有64字节长度传输的理想内存系统中,总线利用率提高了高达2.5倍,同时所需的查找表减少了11%,触发器减少了23%,且无需块RAM。在深度内存系统中,对于64字节长度的传输,我们的总线利用率优势可扩大到3.6倍。我们在GlobalFoundries的GF12LP+工艺节点上综合了我们的DMAC,实现了超过1.44 GHz的时钟频率,而面积仅占49.5 kGE。