The choice of activation function plays a crucial role in the optimization and performance of deep neural networks. While the Rectified Linear Unit (ReLU) remains the dominant choice due to its simplicity and effectiveness, its lack of smoothness may hinder gradient-based optimization in deep architectures. In this work we propose a family of $C^{2N}$-smooth activation functions whose gate follows a log-logistic CDF, achieving ReLU-like performance with purely rational arithmetic. We introduce three variants: GEM (the base family), E-GEM (an $ε$-parameterized generalization enabling arbitrary $L^p$-approximation of ReLU), and SE-GEM (a piecewise variant eliminating dead neurons with $C^{2N}$ junction smoothness). An $N$-ablation study establishes $N=1$ as optimal for standard-depth networks, reducing the GELU deficit on CIFAR-100 + ResNet-56 from 6.10% to 2.12%. The smoothness parameter $N$ further reveals a CNN-transformer tradeoff: $N=1$ is preferred for deep CNNs, while $N=2$ is preferred for transformers. On MNIST, E-GEM ties the best baseline (99.23%). On CIFAR-10 + ResNet-56, SE-GEM ($ε=10^{-4}$) surpasses GELU (92.51% vs 92.44%) -- the first GEM-family activation to outperform GELU. On CIFAR-100 + ResNet-56, E-GEM reduces the GELU deficit from 6.10% (GEM $N=2$) to just 0.62%. On GPT-2 (124M), GEM achieves the lowest perplexity (72.57 vs 73.76 for GELU), with GEM $N=1$ also beating GELU (73.32). On BERT-small, E-GEM ($ε=10$) achieves the best validation loss (6.656) across all activations. The $ε$-parameterization reveals a scale-dependent optimum: small $ε$ ($10^{-4}$--$10^{-6}$) for deep CNNs and larger transformers, with the special case of small transformers (BERT-small) benefiting from large $ε$ ($ε=10$) due to its limited depth and unconstrained gradients.
翻译:激活函数的选择对深度神经网络的优化与性能具有关键作用。尽管线性整流单元(ReLU)因其简洁高效仍占据主导地位,但其光滑性缺失可能阻碍深度架构中基于梯度的优化。本文提出一类$C^{2N}$光滑激活函数族,其门控机制遵循逻辑斯蒂累积分布函数,通过纯有理运算实现类ReLU性能。我们引入三种变体:GEM(基础族)、E-GEM(参数化推广允许对ReLU进行任意$L^p$逼近)以及SE-GEM(消除死神经元问题的分段变体,具备$C^{2N}$交界光滑性)。通过N消融实验确定标准深度网络的最优参数为$N=1$,将CIFAR-100+ResNet-56上GELU的性能差距从6.10%降至2.12%。光滑参数$N$进一步揭示卷积神经网络与Transformer的权衡关系:深度CNN优选$N=1$,而Transformer则倾向$N=2$。在MNIST上,E-GEM持平最佳基线(99.23%);在CIFAR-10+ResNet-56上,SE-GEM($ε=10^{-4}$)以92.51%超越GELU的92.44%——首次实现GEM族激活函数超越GELU;在CIFAR-100+ResNet-56上,E-GEM将GELU的性能差距从6.10%(GEM $N=2$)缩减至0.62%。在GPT-2(124M)上,GEM取得最低困惑度(72.57对比GELU的73.76),且GEM $N=1$同样优于GELU(73.32)。在BERT-small上,E-GEM($ε=10$)以6.656的验证损失在所有激活函数中表现最优。参数化$ε$揭示尺度依赖性最优解:深度CNN与大型Transformer适用小$ε$($10^{-4}$--$10^{-6}$),而小型Transformer(如BERT-small)因深度有限且梯度无约束,大$ε$($ε=10$)更优。