The growing demand for efficient, high-performance processing in machine learning (ML) and image processing has made hardware accelerators, such as GPUs and Data Streaming Accelerators (DSAs), increasingly essential. These accelerators enhance ML and image processing tasks by offloading computation from the CPU to dedicated hardware. These accelerators rely on interconnects for efficient data transfer, making interconnect design crucial for system-level performance. This paper introduces Gem5-AcceSys, an innovative framework for system-level exploration of standard interconnects and configurable memory hierarchies. Using a matrix multiplication accelerator tailored for transformer workloads as a case study, we evaluate PCIe performance across diverse memory types (DDR4, DDR5, GDDR6, HBM2) and configurations, including host-side and device-side memory. Our findings demonstrate that optimized interconnects can achieve up to 80% of device-side memory performance and, in some scenarios, even surpass it. These results offer actionable insights for system architects, enabling a balanced approach to performance and cost in next-generation accelerator design.
翻译:随着机器学习和图像处理对高效、高性能计算需求的日益增长,GPU和数据流加速器(DSA)等硬件加速器变得愈发重要。这些加速器通过将计算任务从CPU卸载到专用硬件上,显著提升了机器学习和图像处理任务的执行效率。这些加速器依赖互连技术实现高效数据传输,因此互连设计对系统级性能至关重要。本文介绍了Gem5-AcceSys,这是一个用于标准互连和可配置内存层次结构系统级探索的创新框架。我们以针对Transformer工作负载定制的矩阵乘法加速器作为案例研究,评估了PCIe在不同内存类型(DDR4、DDR5、GDDR6、HBM2)和配置(包括主机端和设备端内存)下的性能。我们的研究结果表明,经过优化的互连可以实现高达设备端内存性能80%的性能,在某些场景下甚至能够超越设备端内存性能。这些结果为系统架构师提供了可行的见解,使其能够在下一代加速器设计中实现性能与成本的平衡。