Our work proposes a novel approach to designing activation functions by focusing on their gradients and deriving the corresponding activation functions using integration. We introduce the Expanded Integral of the Exponential Linear Unit (xIELU), a trainable piecewise activation function derived by integrating trainable affine transformations applied to the Exponential Linear Unit (ELU). xIELU combines two key properties for the gradient: (1) a trainable and linearly increasing gradient for positive inputs, similar to Squared ReLU (ReLU$^2$), and (2) a trainable gradient that can take negative values for negative inputs, inspired by Expanded SiLU (xSiLU). Conceptually, xIELU can be viewed as an extension of ReLU$^2$ to handle negative inputs. The trainable parameters in xIELU allow it to adaptively reduce its nonlinearity for higher-level representations deeper in the network. In experiments with 1.1B and 3B parameter Llama models trained on 125B tokens of FineWeb Edu, xIELU achieves lower perplexity compared to popular activation functions like ReLU$^2$ and SwiGLU when matched for the same compute cost and parameter count. A reference implementation is available at https://github.com/Anonymous5823/xielu.
翻译:本文提出一种设计激活函数的新方法,该方法聚焦于激活函数的梯度特性,并通过积分运算推导出相应的激活函数。我们提出了扩展指数线性单元积分(xIELU),这是一种通过积分应用于指数线性单元(ELU)的可训练仿射变换而导出的分段可训练激活函数。xIELU融合了梯度的两个关键特性:(1)对正输入具有可训练且线性递增的梯度,类似于平方修正线性单元(ReLU$^2$);(2)对负输入具有可取得负值的可训练梯度,其设计灵感来源于扩展S型线性单元(xSiLU)。从概念上讲,xIELU可视为ReLU$^2$处理负输入的扩展形式。xIELU中的可训练参数使其能够自适应地降低网络深层中高层表征的非线性程度。在使用1250亿个FineWeb Edu教育语料库训练的11亿和30亿参数Llama模型实验中,在保持相同计算成本与参数量的条件下,xIELU相比ReLU$^2$、SwiGLU等主流激活函数获得了更低的困惑度。参考实现已发布于https://github.com/Anonymous5823/xielu。