Layer normalization (LN) is an essential component of modern neural networks. While many alternative techniques have been proposed, none of them have succeeded in replacing LN so far. The latest suggestion in this line of research is a dynamic activation function called Dynamic Tanh (DyT). Although it is empirically well-motivated and appealing from a practical point of view, it lacks a theoretical foundation. In this work, we shed light on the mathematical relationship between LN and dynamic activation functions. In particular, we derive DyT from the LN variant RMSNorm, and show that a well-defined decoupling in derivative space as well as an approximation are needed to do so. By applying the same decoupling procedure directly in function space, we are able to omit the approximation and obtain the exact element-wise counterpart of RMSNorm, which we call Dynamic Inverse Square Root Unit (DyISRU). We demonstrate numerically that DyISRU reproduces the normalization effect on outliers more accurately than DyT does.
翻译:层归一化(LN)是现代神经网络的关键组成部分。尽管已有许多替代技术被提出,但迄今为止尚未有任何方法能够成功取代LN。该研究方向的最新建议是一种称为动态双曲正切(DyT)的动态激活函数。尽管该函数在经验上具有良好的动机且从实用角度看颇具吸引力,但其缺乏理论基础。在本研究中,我们揭示了LN与动态激活函数之间的数学关系。具体而言,我们从LN变体RMSNorm推导出DyT,并证明这一推导过程需要导数空间中定义明确的解耦操作以及近似处理。通过在函数空间中直接应用相同的解耦过程,我们能够省略近似步骤,得到RMSNorm的精确逐元素对应形式,并将其命名为动态逆平方根单元(DyISRU)。我们通过数值实验证明,DyISRU在异常值上的归一化效果比DyT更为精确。