The placement of normalization layers, specifically Pre-Norm and Post-Norm, remains an open question in Transformer architecture design. In this work, we rethink these approaches through the lens of manifold optimization, interpreting the outputs of the Feed-Forward Network (FFN) and attention layers as update directions in optimization. Building on this perspective, we introduce GeoNorm, a novel method that replaces standard normalization with geodesic updates on the manifold. Furthermore, analogous to learning rate schedules, we propose a layer-wise update decay for the FFN and attention components. Comprehensive experiments demonstrate that GeoNorm consistently outperforms existing normalization methods in Transformer models. Crucially, GeoNorm can be seamlessly integrated into standard Transformer architectures, achieving performance improvements with negligible additional computational cost.
翻译:归一化层(特别是预归一化与后归一化)的放置位置在Transformer架构设计中仍是一个开放性问题。本文从流形优化的角度重新审视这些方法,将前馈网络与注意力层的输出解释为优化中的更新方向。基于这一视角,我们提出了GeoNorm,一种新颖的方法,它用流形上的测地更新替代了标准归一化。此外,类比于学习率调度,我们为前馈网络和注意力组件提出了一种逐层更新衰减策略。全面的实验表明,在Transformer模型中,GeoNorm始终优于现有的归一化方法。关键的是,GeoNorm可以无缝集成到标准Transformer架构中,以几乎可忽略的额外计算成本实现性能提升。