Tensor processing units (TPUs), specialized hardware accelerators for machine learning tasks, have shown significant performance improvements when executing convolutional layers in convolutional neural networks (CNNs). However, they struggle to maintain the same efficiency in fully connected (FC) layers, leading to suboptimal hardware utilization. In-memory analog computing (IMAC) architectures, on the other hand, have demonstrated notable speedup in executing FC layers. This paper introduces a novel, heterogeneous, mixed-signal, and mixed-precision architecture that integrates an IMAC unit with an edge TPU to enhance mobile CNN performance. To leverage the strengths of TPUs for convolutional layers and IMAC circuits for dense layers, we propose a unified learning algorithm that incorporates mixed-precision training techniques to mitigate potential accuracy drops when deploying models on the TPU-IMAC architecture. The simulations demonstrate that the TPU-IMAC configuration achieves up to $2.59\times$ performance improvements, and $88\%$ memory reductions compared to conventional TPU architectures for various CNN models while maintaining comparable accuracy. The TPU-IMAC architecture shows potential for various applications where energy efficiency and high performance are essential, such as edge computing and real-time processing in mobile devices. The unified training algorithm and the integration of IMAC and TPU architectures contribute to the potential impact of this research on the broader machine learning landscape.
翻译:张量处理单元(TPU)作为专用机器学习硬件加速器,在执行卷积神经网络(CNN)的卷积层时展现出显著的性能提升。然而,其在全连接(FC)层的处理中难以维持同等效率,导致硬件利用率欠佳。相比之下,内存模拟计算(IMAC)架构在执行全连接层时表现出显著的加速效果。本文提出一种新颖的异构混合信号混合精度架构,将IMAC单元与边缘TPU集成,以提升移动端CNN性能。为充分发挥TPU在卷积层与IMAC电路在密集层的优势,我们提出统一学习算法,融合混合精度训练技术以缓解模型部署于TPU-IMAC架构时可能出现的精度下降。仿真表明,在各类CNN模型上,TPU-IMAC配置相较传统TPU架构可实现高达$2.59\times$的性能提升与$88\%$的内存缩减,同时保持可比拟的准确率。该架构在能效与高性能至关重要的应用场景(如边缘计算与移动设备实时处理)中展现出潜力。统一训练算法与IMAC-TPU架构的协同设计,进一步凸显了本研究对广义机器学习领域的潜在影响。