Transformers have found extensive applications across various domains due to the powerful fitting capabilities. This success can be partially attributed to their inherent nonlinearity. Thus, in addition to the ReLU function employed in the original transformer architecture, researchers have explored alternative modules such as GeLU and SwishGLU to enhance nonlinearity and thereby augment representational capacity. In this paper, we propose a novel category of polynomial composition activations (PolyCom), designed to optimize the dynamics of transformers. Theoretically, we provide a comprehensive mathematical analysis of PolyCom, highlighting its enhanced expressivity and efficacy relative to other activation functions. Notably, we demonstrate that networks incorporating PolyCom achieve the $\textbf{optimal approximation rate}$, indicating that PolyCom networks require minimal parameters to approximate general smooth functions in Sobolev spaces. We conduct empirical experiments on the pre-training configurations of large language models (LLMs), including both dense and sparse architectures. By substituting conventional activation functions with PolyCom, we enable LLMs to capture higher-order interactions within the data, thus improving performance metrics in terms of accuracy and convergence rates. Extensive experimental results demonstrate the effectiveness of our method, showing substantial improvements over other activation functions. Code is available at https://github.com/BryceZhuo/PolyCom.
翻译:Transformer凭借其强大的拟合能力已在多个领域得到广泛应用。这种成功部分归功于其固有的非线性特性。因此,除了原始Transformer架构中使用的ReLU函数外,研究者们还探索了GeLU和SwishGLU等替代模块以增强非线性,从而提升模型的表示能力。本文提出了一类新颖的多项式组合激活函数(PolyCom),旨在优化Transformer的动态特性。理论上,我们对PolyCom进行了全面的数学分析,阐明了其相较于其他激活函数在表达能力和效能上的增强优势。特别地,我们证明了采用PolyCom的网络能够达到$\textbf{最优逼近速率}$,这表明PolyCom网络仅需最少的参数即可逼近Sobolev空间中的一般光滑函数。我们在大型语言模型(LLMs)的预训练配置上进行了实证实验,包括稠密与稀疏两种架构。通过将传统激活函数替换为PolyCom,我们使LLMs能够捕捉数据中的高阶交互作用,从而在准确性和收敛速度方面提升了性能指标。大量实验结果证明了我们方法的有效性,显示出其相较于其他激活函数的显著改进。代码发布于https://github.com/BryceZhuo/PolyCom。