In this paper, we propose different alternatives for convolutional neural networks (CNNs) segmentation, addressing inference processes on computing architectures composed by multiple Edge TPUs. Specifically, we compare the inference performance for a number of state-of-the-art CNN models taking as a reference inference times on one TPU and a compiler-based pipelined inference implementation as provided by the Google's Edge TPU compiler. Departing from a profiled-based segmentation strategy, we provide further refinements to balance the workload across multiple TPUs, leveraging their cooperative computing power, reducing work imbalance and alleviating the memory access bottleneck due to the limited amount of on-chip memory per TPU. The observed performance results compared with a single TPU yield superlinear speedups and accelerations up to 2.60x compared with the segmentation offered by the compiler targeting multiple TPUs.
翻译:本文针对由多个Edge TPU组成的计算架构,提出了卷积神经网络(CNNs)分割的不同替代方案。具体而言,我们以单TPU推理时间和Google Edge TPU编译器提供的基于编译器的流水线推理实现为基准,比较了多种先进CNN模型的推理性能。在基于性能分析的分割策略基础上,我们通过进一步优化实现了多TPU间工作负载的均衡分配,充分利用其协同计算能力,减少工作负载不均衡现象,并缓解因每个TPU片上内存有限导致的内存访问瓶颈。实验结果表明:相较于单TPU方案,该方法实现了超线性加速;与编译器针对多TPU提供的分割方案相比,最高可获得2.60倍的加速比。