Today's performance analysis frameworks for deep learning accelerators suffer from two significant limitations. First, although modern convolutional neural network (CNNs) consist of many types of layers other than convolution, especially during training, these frameworks largely focus on convolution layers only. Second, these frameworks are generally targeted towards inference, and lack support for training operations. This work proposes a novel performance analysis framework, SimDIT, for general ASIC-based systolic hardware accelerator platforms. The modeling effort of SimDIT comprehensively covers convolution and non-convolution operations of both CNN inference and training on a highly parameterizable hardware substrate. SimDIT is integrated with a backend silicon implementation flow and provides detailed end-to-end performance statistics (i.e., data access cost, cycle counts, energy, and power) for executing CNN inference and training workloads. SimDIT-enabled performance analysis reveals that on a 64X64 processing array, non-convolution operations constitute 59.5% of total runtime for ResNet-50 training workload. In addition, by optimally distributing available off-chip DRAM bandwidth and on-chip SRAM resources, SimDIT achieves 18X performance improvement over a generic static resource allocation for ResNet-50 inference.
翻译:当前深度学习加速器的性能分析框架存在两大显著局限。其一,尽管现代卷积神经网络在训练阶段包含除卷积外的多种层类型,但现有框架主要聚焦于卷积层。其二,这些框架通常面向推理场景,缺乏对训练操作的支持。本文提出一种面向通用ASIC脉动硬件加速器平台的新型性能分析框架SimDIT。SimDIT的建模工作在高度参数化的硬件基座上全面覆盖CNN推理与训练中的卷积及非卷积操作。该框架集成后端硅实现流程,可提供执行CNN推理与训练工作负载的端到端详细性能统计数据(即数据访问开销、周期数、能耗与功耗)。基于SimDIT的性能分析表明,在64×64处理阵列上,非卷积操作占ResNet-50训练任务总运行时长的59.5%。此外,通过优化分配片外DRAM带宽与片上SRAM资源,SimDIT相比通用静态资源分配方案为ResNet-50推理任务实现了18倍的性能提升。