In this paper, we systematically evaluate the inference performance of the Edge TPU by Google for neural networks with different characteristics. Specifically, we determine that, given the limited amount of on-chip memory on the Edge TPU, accesses to external (host) memory rapidly become an important performance bottleneck. We demonstrate how multiple devices can be jointly used to alleviate the bottleneck introduced by accessing the host memory. We propose a solution combining model segmentation and pipelining on up to four TPUs, with remarkable performance improvements that range from $6\times$ for neural networks with convolutional layers to $46\times$ for fully connected layers, compared with single-TPU setups.
翻译:本文系统评估了谷歌Edge TPU针对不同特性神经网络的推理性能。研究发现,由于Edge TPU片上内存容量有限,访问外部(主机)内存迅速成为关键性能瓶颈。我们论证了如何通过协同使用多台设备来缓解主机内存访问带来的瓶颈问题。提出了一种结合模型分割与流水线技术的解决方案,可在最多四个TPU上实现显著性能提升:相较于单TPU配置,卷积神经网络加速达$6\times$,全连接网络加速更可达$46\times$。