Data transfers are essential in today's computing systems as latency and complex memory access patterns are increasingly challenging to manage. Direct memory access engines (DMAEs) are critically needed to transfer data independently of the processing elements, hiding latency and achieving high throughput even for complex access patterns to high-latency memory. With the prevalence of heterogeneous systems, DMAEs must operate efficiently in increasingly diverse environments. This work proposes a modular and highly configurable open-source DMAE architecture called intelligent DMA (iDMA), split into three parts that can be composed and customized independently. The front-end implements the control plane binding to the surrounding system. The mid-end accelerates complex data transfer patterns such as multi-dimensional transfers, scattering, or gathering. The back-end interfaces with the on-chip communication fabric (data plane). We assess the efficiency of iDMA in various instantiations: In high-performance systems, we achieve speedups of up to 15.8x with only 1 % additional area compared to a base system without a DMAE. We achieve an area reduction of 10 % while improving ML inference performance by 23 % in ultra-low-energy edge AI systems over an existing DMAE solution. We provide area, timing, latency, and performance characterization to guide its instantiation in various systems.
翻译:数据迁移在现代计算系统中至关重要,因为延迟和复杂的内存访问模式日益难以管理。直接内存访问引擎(DMAE)独立于处理元件进行数据传输,能够隐藏延迟,即使对于高延迟存储器中的复杂访问模式也能实现高吞吐量。随着异构系统的普及,DMAE必须在日益多样化的环境中高效运行。本文提出一种模块化、高度可配置的开源DMAE架构——智能DMA(iDMA),将其划分为三个可独立组合与定制的部分。前端实现与周围系统绑定的控制平面;中端加速多维传输、分散或聚集等复杂数据传输模式;后端与片上通信结构(数据平面)接口。我们评估了iDMA在不同实例化中的效率:在高性能系统中,相比于无DMAE的基准系统,我们仅增加1%面积即可实现高达15.8倍的加速比。在超低能耗边缘AI系统中,相比现有DMAE方案,我们实现面积减少10%,同时ML推理性能提升23%。我们提供面积、时序、延迟与性能表征,以指导其在不同系统中的实例化。