FETTA: Flexible and Efficient Hardware Accelerator for Tensorized Neural Network Training

The increasing demand for on-device training of deep neural networks (DNNs) aims to leverage personal data for high-performance applications while addressing privacy concerns and reducing communication latency. However, resource-constrained platforms face significant challenges due to the intensive computational and memory demands of DNN training. Tensor decomposition emerges as a promising approach to compress model size without sacrificing accuracy. Nevertheless, training tensorized neural networks (TNNs) incurs non-trivial overhead and severe performance degradation on conventional accelerators due to complex tensor shaping requirements. To address these challenges, we propose FETTA, an algorithm and hardware co-optimization framework for efficient TNN training. On the algorithm side, we develop a contraction sequence search engine (CSSE) to identify the optimal contraction sequence with the minimal computational overhead. On the hardware side, FETTA features a flexible and efficient architecture equipped with a reconfigurable contraction engine (CE) array to support diverse dataflows. Furthermore, butterfly-based distribution and reduction networks are implemented to perform flexible tensor shaping operations during computation. Evaluation results demonstrate that FETTA achieves reductions of 20.5x/100.9x, 567.5x/45.03x, and 11609.7x/4544.8x in terms of processing latency, energy, and energy-delay product (EDP) over GPU and TPU, respectively. Moreover, working on the tensorized training, FETTA outperforms prior accelerators with a speedup of 3.87~14.63x, and an energy efficiency improvement of 1.41~2.73x on average.

翻译：随着深度神经网络（DNN）设备端训练需求的日益增长，旨在利用个人数据实现高性能应用，同时解决隐私问题并降低通信延迟。然而，资源受限平台因DNN训练对计算和内存的密集需求而面临重大挑战。张量分解作为在不牺牲准确率的前提下压缩模型尺寸的有效手段应运而生。然而，张量化神经网络（TNN）训练因复杂的张量重塑需求，在传统加速器上会产生显著开销和严重性能下降。针对这些挑战，我们提出FETTA——一种算法与硬件协同优化框架，用于实现高效的TNN训练。在算法层面，我们开发了收缩序列搜索引擎（CSSE），用于识别计算开销最小的最优收缩序列。在硬件层面，FETTA采用灵活高效的架构，配备可配置收缩引擎（CE）阵列以支持多种数据流。此外，基于蝶形结构的分配与归约网络被实现，用于在计算过程中执行灵活的张量重塑操作。评估结果表明，与GPU和TPU相比，FETTA在处理延迟、能耗和能耗延迟积（EDP）上分别实现了20.5倍/100.9倍、567.5倍/45.03倍和11609.7倍/4544.8倍的降低。此外，在张量化训练任务中，FETTA相较于先前的加速器实现了3.87~14.63倍的加速比，以及平均1.41~2.73倍的能效提升。