Activation functions are core components of all deep learning architectures. Currently, the most popular activation functions are smooth ReLU variants like GELU and SiLU. These are self-gated activation functions where the range of the gating function is between zero and one. In this paper, we explore the viability of using arctan as a gating mechanism. A self-gated activation function that uses arctan as its gating function has a monotonically increasing first derivative. To make this activation function competitive, it is necessary to introduce a trainable parameter for every MLP block to expand the range of the gating function beyond zero and one. We find that this technique also improves existing self-gated activation functions. We conduct an empirical evaluation of Expanded ArcTan Linear Unit (xATLU), Expanded GELU (xGELU), and Expanded SiLU (xSiLU) and show that they outperform existing activation functions within a transformer architecture. Additionally, expanded gating ranges show promising results in improving first-order Gated Linear Units (GLU).
翻译:激活函数是所有深度学习架构的核心组件。目前最流行的激活函数是平滑ReLU变体,如GELU和SiLU。这些是自门控激活函数,其门控函数的范围介于零和一之间。本文探讨了使用反正切函数作为门控机制的可行性。采用反正切作为门控函数的自门控激活函数具有单调递增的一阶导数。为使该激活函数具备竞争力,需要为每个MLP模块引入可训练参数,将门控函数的范围扩展至零和一之外。我们发现该技术同样能改进现有的自门控激活函数。我们对扩展反正切线性单元(xATLU)、扩展GELU(xGELU)和扩展SiLU(xSiLU)进行了实证评估,结果表明它们在Transformer架构中优于现有激活函数。此外,扩展门控范围在改进一阶门控线性单元(GLU)方面也展现出良好前景。