Recently, efficiently deploying deep learning solutions on the edge has received increasing attention. New platforms are emerging to support the increasing demand for flexibility and high performance. In this work, we explore the efficient mapping of convolutional layers on an open-hardware, low-power Coarse-Grain Reconfigurable Array (CGRA), namely OpenEdgeCGRA. We explore both direct implementations of convolution and solutions that transform it into a matrix multiplication through an Im2col transformation, and experiment with various tensor parallelism axes. We show that for this hardware target, direct convolution, coupled with weight parallelism reaches the best latency and energy efficiency, outperforming a CPU implementation by 3.4x and 9.9x in terms of energy and latency, respectively.
翻译:近期,在边缘设备上高效部署深度学习解决方案受到越来越多的关注。为满足对灵活性与高性能日益增长的需求,新型平台不断涌现。本研究探索了在开放硬件、低功耗粗粒度可重构阵列(CGRA)——OpenEdgeCGRA上高效映射卷积层的方法。我们同时研究了卷积的直接实现方案以及通过Im2col变换将卷积转化为矩阵乘法的解决方案,并实验了多种张量并行方向。结果表明,针对该硬件目标,直接卷积结合权重并行取得了最佳的延迟与能效表现,在能耗和延迟方面分别比CPU实现提升了3.4倍和9.9倍。