Activation functions play a crucial role in introducing non-linearities to deep neural networks. We propose a novel approach to designing activation functions by focusing on their gradients and deriving the corresponding functions through integration. Our work introduces the Expanded Integral of the Exponential Linear Unit (xIELU), a trainable piecewise activation function derived by integrating trainable affine transformations applied on the ELU activation function. xIELU combines two key gradient properties: a trainable and linearly increasing gradient for positive inputs, similar to ReLU$^2$, and a trainable negative gradient flow for negative inputs, akin to xSiLU. Conceptually, xIELU can be viewed as extending ReLU$^2$ to effectively handle negative inputs. In experiments with 1.1B parameter Llama models trained on 126B tokens of FineWeb Edu, xIELU achieves lower perplexity compared to both ReLU$^2$ and SwiGLU when matched for the same compute cost and parameter count.
翻译:激活函数在深度神经网络中引入非线性方面发挥着至关重要的作用。我们提出了一种设计激活函数的新方法,该方法聚焦于其梯度,并通过积分推导出相应的函数。我们的工作引入了扩展指数线性单元积分(xIELU),这是一种通过积分应用于ELU激活函数的可训练仿射变换而导出的可训练分段激活函数。xIELU结合了两个关键的梯度特性:对于正输入,具有类似ReLU$^2$的可训练且线性增加的梯度;对于负输入,则具有类似xSiLU的可训练负梯度流。从概念上讲,xIELU可视为将ReLU$^2$扩展以有效处理负输入。在使用在126B个FineWeb Edu词元上训练的11亿参数Llama模型进行的实验中,在计算成本和参数数量相同的情况下,xIELU相比ReLU$^2$和SwiGLU都获得了更低的困惑度。