Cocco: Hardware-Mapping Co-Exploration towards Memory Capacity-Communication Optimization

Memory is a critical design consideration in current data-intensive DNN accelerators, as it profoundly determines energy consumption, bandwidth requirements, and area costs. As DNN structures become more complex, a larger on-chip memory capacity is required to reduce data movement overhead, but at the expense of silicon costs. Some previous works have proposed memory-oriented optimizations, such as different data reuse and layer fusion schemes. However, these methods are not general and potent enough to cope with various graph structures. In this paper, we explore the intrinsic connection between network structures and memory features to optimize both hardware and mapping. First, we introduce a graph-level execution scheme with a corresponding dataflow and memory management method. This scheme enables the execution of arbitrary graph patterns with high data reuse and low hardware overhead. Subsequently, we propose Cocco, a hardware-mapping co-exploration framework leveraging graph-level features of networks. It aims to minimize communication overhead, such as energy consumption and bandwidth requirements, with a smaller memory capacity. We formulate the graph-partition scheduling and memory configuration search as an optimization problem and employ a genetic-based method to achieve efficient co-exploration for large and irregular networks. Experiments demonstrate that Cocco obtains lower external memory access, lower bandwidth requirements, and more stable optimization for graph partition compared to the greedy algorithm and dynamic programming introduced in prior works. Cocco also reduces the costs by 1.89% to 50.33% using co-exploration compared to other typical methods.

翻译：摘要：在当前的数密集型深度神经网络加速器中，存储是一个关键的设计考量，因为它深刻决定了能耗、带宽需求和面积成本。随着DNN结构日益复杂，需要更大的片上存储容量以减少数据搬移开销，但这会牺牲硅片成本。以往一些工作提出了面向存储的优化方法，例如不同的数据重用和层融合方案。然而，这些方法在应对多样化图结构时缺乏通用性和有效性。本文探索了网络结构与存储特性之间的内在联系，以同时优化硬件与映射方案。首先，我们引入了一种图级执行方案及其对应的数据流和存储管理方法。该方案可支持任意图模式的高数据重用执行，并实现低硬件开销。随后，我们提出Cocco——一种利用网络图级特征的硬件-映射协同探索框架。该框架旨在以更小的存储容量最小化通信开销（如能耗和带宽需求）。我们将图划分调度与存储配置搜索建模为优化问题，并采用基于遗传的方法对大规模非规则网络实现高效协同探索。实验表明，与先前工作中引入的贪婪算法和动态规划相比，Cocco在外部存储访问量、带宽需求以及图划分的优化稳定性方面均表现更优。此外，通过协同探索，Cocco相较于其他典型方法可降低1.89%至50.33%的成本。