The self-attention mechanism distinguishes transformer-based large language models (LLMs) apart from convolutional and recurrent neural networks. Despite the performance improvement, achieving real-time LLM inference on silicon remains challenging due to the extensive use of Softmax in self-attention. In addition to the non-linearity, the low arithmetic intensity significantly limits processing parallelism, especially when working with longer contexts. To address this challenge, we propose Constant Softmax (ConSmax), a software-hardware co-design that serves as an efficient alternative to Softmax. ConSmax utilizes differentiable normalization parameters to eliminate the need for maximum searching and denominator summation in Softmax. This approach enables extensive parallelization while still executing the essential functions of Softmax. Moreover, a scalable ConSmax hardware design with a bitwidth-split look-up table (LUT) can achieve lossless non-linear operations and support mixed-precision computing. Experimental results show that ConSmax achieves a minuscule power consumption of 0.2mW and an area of 0.0008mm^2 at 1250MHz working frequency in 16nm FinFET technology. For open-source contribution, we further implement our design with the OpenROAD toolchain under SkyWater's 130nm CMOS technology. The corresponding power is 2.69mW and the area is 0.007mm^2. ConSmax achieves 3.35x power savings and 2.75x area savings in 16nm technology, and 3.15x power savings and 4.14x area savings with the open-source EDA toolchain. In the meantime, it also maintains comparable accuracy on the GPT-2 model and the WikiText103 dataset. The project is available at https://github.com/ReaLLMASIC/ConSmax
翻译:自注意力机制使基于Transformer的大语言模型(LLM)区别于卷积神经网络和循环神经网络。尽管性能有所提升,但由于自注意力中广泛使用Softmax,在硅基芯片上实现实时LLM推理仍具挑战性。除了非线性特性外,其低算术强度显著限制了处理并行性,尤其在处理长上下文时更为突出。为应对这一挑战,我们提出常数Softmax(ConSmax),这是一种软硬件协同设计,可作为Softmax的高效替代方案。ConSmax利用可微归一化参数,消除了Softmax中对最大值搜索和分母求和的需求。该方法在仍执行Softmax核心功能的同时,实现了高度并行化。此外,采用位宽分割查找表(LUT)的可扩展ConSmax硬件设计能够实现无损非线性运算并支持混合精度计算。实验结果表明,在16nm FinFET工艺下,ConSmax在1250MHz工作频率下功耗仅为0.2mW,面积为0.0008mm^2。为促进开源贡献,我们进一步使用OpenROAD工具链在SkyWater 130nm CMOS工艺中实现了该设计,对应功耗为2.69mW,面积为0.007mm^2。在16nm工艺中,ConSmax实现了3.35倍的功耗节省和2.75倍的面积节省;使用开源EDA工具链时,则实现了3.15倍的功耗节省和4.14倍的面积节省。同时,在GPT-2模型和WikiText103数据集上仍保持了相当的精度。项目开源地址:https://github.com/ReaLLMASIC/ConSmax