Data transfers are essential in today's computing systems as latency and complex memory access patterns are increasingly challenging to manage. Direct memory access engines (DMAEs) are critically needed to transfer data independently of the processing elements, hiding latency and achieving high throughput even for complex access patterns to high-latency memory. With the prevalence of heterogeneous systems, DMAEs must operate efficiently in increasingly diverse environments. This work proposes a modular and highly configurable open-source DMAE architecture called intelligent DMA (iDMA), split into three parts that can be composed and customized independently. The front-end implements the control plane binding to the surrounding system. The mid-end accelerates complex data transfer patterns such as multi-dimensional transfers, scattering, or gathering. The back-end interfaces with the on-chip communication fabric (data plane). We assess the efficiency of iDMA in various instantiations: In high-performance systems, we achieve speedups of up to 15.8x with only 1 % additional area compared to a base system without a DMAE. We achieve an area reduction of 10 % while improving ML inference performance by 23 % in ultra-low-energy edge AI systems over an existing DMAE solution. We provide area, timing, latency, and performance characterization to guide its instantiation in various systems.
翻译:数据传输在现代计算系统中至关重要,因为延迟和复杂的内存访问模式日益难以管理。直接内存访问引擎(DMAE)对于独立于处理单元传输数据、隐藏延迟以及在面向高延迟内存的复杂访问模式下实现高吞吐量至关重要。随着异构系统的普及,DMAE必须能在日益多样化的环境中高效运行。本文提出了一种名为智能DMA(iDMA)的模块化、高度可配置的开源DMAE架构,该架构分为三个可独立组合与定制的部分:前端实现与周围系统绑定的控制平面;中端加速复杂数据传输模式,如多维传输、分散或汇聚操作;后端与片上通信结构(数据平面)接口。我们在不同实例化中评估了iDMA的效率:在高性能系统中,与无DMAE的基础系统相比,我们仅增加1%的芯片面积即可实现最高15.8倍的加速比。在超低能耗边缘AI系统中,与现有DMAE方案相比,我们在将ML推理性能提升23%的同时减少了10%的芯片面积。我们提供了面积、时序、延迟和性能特征分析,以指导其在各类系统中的实例化。