Network-on-Chip (NoC) based architectures are recently proposed to accelerate deep neural networks in specialized hardware. Given that the hardware configuration is fixed post-manufacture, proper task mapping attracts researchers' interest. We propose a travel time-based task mapping method that allocates uneven counts of tasks across different Processing Elements (PEs). This approach utilizes the travel time recorded in the sampling window and implicitly makes use of static NoC architecture information and dynamic NoC congestion status. Furthermore, we examine the effectiveness of our method under various configurations, including different mapping iterations, flit sizes, and NoC architecture. Our method achieves up to 12.1% improvement compared with even mapping and static distance mapping for one layer. For a complete NN example, our method achieves 10.37% and 13.75% overall improvements to row-major mapping and distance-based mapping, respectively. While ideal travel time-based mapping (post-run) achieves 10.37% overall improvements to row-major mapping, we adopt a sampling window to efficiently map tasks during the running, achieving 8.17% (sampling window 10) improvement.
翻译:基于片上网络(NoC)的架构近来被提出,用于在专用硬件中加速深度神经网络。鉴于硬件配置在制造后即固定,合理的任务映射引起了研究人员的关注。我们提出了一种基于传输时间的任务映射方法,该方法在不同处理单元(PE)间分配不均衡数量的任务。此方法利用采样窗口中记录的传输时间,并隐式地利用了静态NoC架构信息和动态NoC拥塞状态。此外,我们在多种配置下检验了该方法的有效性,包括不同的映射迭代次数、数据片大小以及NoC架构。对于单层网络,相较于均匀映射和静态距离映射,我们的方法实现了高达12.1%的性能提升。对于一个完整的神经网络示例,相较于行主序映射和基于距离的映射,我们的方法分别实现了10.37%和13.75%的整体性能提升。虽然基于理想传输时间的映射(运行后分析)相较于行主序映射可实现10.37%的整体提升,但我们采用采样窗口在运行期间高效映射任务,实现了8.17%(采样窗口大小为10)的性能提升。