Even in recent neural network architectures such as Transformers and Extended LSTM (xLSTM), and traditional ones like Convolutional Neural Networks, Activation Functions are an integral part of nearly all neural networks. They enable more effective training and capture nonlinear data patterns. More than 400 functions have been proposed over the last 30 years, including fixed or trainable parameters, but only a few are widely used. ReLU is one of the most frequently used, with GELU and Swish variants increasingly appearing. However, ReLU presents non-differentiable points and exploding gradient issues, while testing different parameters of GELU and Swish variants produces varying results, needing more parameters to adapt to datasets and architectures. This article introduces a novel set of activation functions called Zorro, a continuously differentiable and flexible family comprising five main functions fusing ReLU and Sigmoid. Zorro functions are smooth and adaptable, and serve as information gates, aligning with ReLU in the 0-1 range, offering an alternative to ReLU without the need for normalization, neuron death, or gradient explosions. Zorro also approximates functions like Swish, GELU, and DGELU, providing parameters to adjust to different datasets and architectures. We tested it on fully connected, convolutional, and transformer architectures to demonstrate its effectiveness.
翻译:即使在Transformer、扩展LSTM(xLSTM)等现代神经网络架构以及卷积神经网络等传统架构中,激活函数仍是几乎所有神经网络不可或缺的组成部分。它们能实现更有效的训练并捕捉数据的非线性模式。过去三十年间已有超过400种函数被提出,包括含固定参数或可训练参数的函数,但仅有少数得到广泛应用。ReLU是最常用的激活函数之一,GELU及其Swish变体的使用也日益增多。然而,ReLU存在不可微点和梯度爆炸问题,而测试GELU与Swish变体的不同参数会产生差异化的结果,且需要更多参数以适应不同数据集和架构。本文提出了一组名为Zorro的新型激活函数,这是一个连续可微且灵活的族系,包含五种融合ReLU与Sigmoid的主函数。Zorro函数具有平滑性与适应性,可作为信息门控机制,在0-1区间内与ReLU对齐,为ReLU提供了无需归一化、避免神经元死亡或梯度爆炸的替代方案。Zorro还能近似Swish、GELU及DGELU等函数,并通过参数调节适应不同数据集与架构。我们在全连接网络、卷积网络及Transformer架构上进行了测试,验证了其有效性。