Specialized hardware accelerators have been extensively used for Deep Neural Networks (DNNs) to provide power/performance benefits. These accelerators contain specialized hardware that supports DNN operators, and scratchpad memory for storing the tensor operands. Often, the size of the scratchpad is insufficient to store all the tensors needed for the computation, and additional data accesses are needed to move tensors back and forth from host memory during the computation with significant power/performance overhead. The volume of these additional data accesses depends on the operator schedule, and memory allocation (specific locations selected for the tensors in the scratchpad). We propose an optimization framework, named COSMA, for mapping DNNs to an accelerator that finds the optimal operator schedule, memory allocation and tensor replacement that minimizes the additional data accesses. COSMA provides an Integer Linear Programming (ILP) formulation to generate the optimal solution for mapping a DNN to the accelerator for a given scratchpad size. We demonstrate that, using an off-the-shelf ILP solver, COSMA obtains the optimal solution in seconds for a wide-range of state-of-the-art DNNs for different applications. Further, it out-performs existing methods by reducing on average 84% of the non-compulsory data accesses. We further propose a divide-and-conquer heuristic to scale up to certain complex DNNs generated by Neural Architecture Search, and this heuristic solution reduces on average 85% data accesses compared with other works.
翻译:专用硬件加速器已广泛应用于深度神经网络(DNN),以提供性能和功耗优势。这些加速器包含支持DNN运算的专用硬件以及用于存储张量操作数的暂存器内存。通常,暂存器大小不足以存储计算所需的所有张量,因此在计算过程中需要额外的数据访问,将张量从主机内存来回移动,这会导致显著的性能开销和功耗增加。这些额外数据访问的规模取决于算子调度和内存分配(即为暂存器中的张量选择的特定位置)。我们提出了一种名为COSMA的优化框架,用于将DNN映射到加速器,该框架能够找到最优的算子调度、内存分配和张量替换方案,从而最小化额外数据访问。COSMA采用整数线性规划(ILP)公式,为给定暂存器大小下的DNN到加速器的映射生成最优解。我们证明,使用现成的ILP求解器,COSMA能够在数秒内为不同应用中的多种先进DNN获得最优解。此外,与现有方法相比,它平均减少了84%的非强制性数据访问。我们进一步提出了一种分治启发式算法,以扩展到由神经架构搜索生成的某些复杂DNN,与其他工作相比,该启发式解平均减少了85%的数据访问。