Transformer networks have emerged as the state-of-the-art approach for natural language processing tasks and are gaining popularity in other domains such as computer vision and audio processing. However, the efficient hardware acceleration of transformer models poses new challenges due to their high arithmetic intensities, large memory requirements, and complex dataflow dependencies. In this work, we propose ITA, a novel accelerator architecture for transformers and related models that targets efficient inference on embedded systems by exploiting 8-bit quantization and an innovative softmax implementation that operates exclusively on integer values. By computing on-the-fly in streaming mode, our softmax implementation minimizes data movement and energy consumption. ITA achieves competitive energy efficiency with respect to state-of-the-art transformer accelerators with 16.9 TOPS/W, while outperforming them in area efficiency with 5.93 TOPS/mm$^2$ in 22 nm fully-depleted silicon-on-insulator technology at 0.8 V.
翻译:Transformer网络已成为自然语言处理任务的最先进方法,并在计算机视觉和音频处理等其他领域日益普及。然而,由于Transformer模型的高算术强度、大内存需求以及复杂的数据流依赖关系,其高效硬件加速面临新的挑战。本文提出ITA——一种面向Transformer及相关模型的新型加速器架构,通过利用8位量化以及一种仅基于整数值操作的创新Softmax实现,实现在嵌入式系统上的高效推理。通过流式模式下的即时计算,我们的Softmax实现最大程度减少了数据移动和能耗。ITA在能效方面达到16.9 TOPS/W,与最先进的Transformer加速器相比具有竞争力,同时在面积效率上以5.93 TOPS/mm$^2$(采用22 nm全耗尽绝缘体上硅技术,0.8 V工作电压)实现超越。