Gaussian Error Linear Unit (GELU) is a widely used smooth alternative to Rectifier Linear Unit (ReLU), yet many deployment, compression, and analysis toolchains are most naturally expressed for piecewise-linear (ReLU-type) networks. We study a hardness-parameterized formulation of GELU, f(x;λ)=xΦ(λ x), where Φ is the Gaussian CDF and λ \in [1, infty) controls gate sharpness, with the goal of turning smooth gated training into a controlled path toward ReLU-compatible models. Learning λ is non-trivial: naive updates yield unstable dynamics and effective gradient attenuation, so we introduce a constrained reparameterization and an optimizer-aware update scheme. Empirically, across a diverse set of model--dataset pairs spanning MLPs, CNNs, and Transformers, we observe structured layerwise hardness profiles and assess their robustness under different initializations. We further study a deterministic ReLU-ization strategy in which the learned gates are progressively hardened toward a principled target, enabling a post-training substitution of λ-GELU by ReLU with reduced disruption. Overall, λ-GELU provides a minimal and interpretable knob to profile and control gating hardness, bridging smooth training with ReLU-centric downstream pipelines.
翻译:高斯误差线性单元(Gaussian Error Linear Unit, GELU)是一种广泛使用的整流线性单元(ReLU)的光滑替代方案,然而许多部署、压缩和分析工具链最自然地适用于分段线性(ReLU型)网络。我们研究了GELU的一种硬度参数化形式f(x;λ)=xΦ(λ x),其中Φ是高斯累积分布函数,λ ∈ [1, ∞)控制门控锐度,目标是将光滑门控训练转变为通向ReLU兼容模型的受控路径。学习λ并非易事:朴素更新会导致不稳定的动态和有效的梯度衰减,因此我们引入了一种约束重参数化和优化器感知的更新方案。在涵盖MLP、CNN和Transformer的多种模型-数据集对上的实验表明,我们观察到了结构化的逐层硬度分布,并评估了它们在不同初始化下的鲁棒性。我们进一步研究了一种确定性ReLU化策略,其中学习到的门控逐渐朝向一个原则性目标硬化,从而能够在训练后将λ-GELU替换为ReLU,并减少性能破坏。总体而言,λ-GELU提供了一种最小且可解释的旋钮来分析和控制门控硬度,连接了光滑训练与以ReLU为中心的下游流水线。