Multiplicative gating is widely used in neural architectures and has recently been applied to attention layers to improve performance and training stability in large language models. Despite the success of gated attention, the mathematical implications of gated attention mechanisms remain poorly understood. We study attention through the geometry of its representations by modeling outputs as mean parameters of Gaussian distributions and analyzing the induced Fisher--Rao geometry. We show that ungated attention operator is restricted to intrinsically flat statistical manifolds due to its affine structure, while multiplicative gating enables non-flat geometries, including positively curved manifolds that are unattainable in the ungated setting. These results establish a geometric expressivity gap between ungated and gated attention. Empirically, we show that gated models exhibit higher representation curvature and improved performance on tasks requiring nonlinear decision boundaries whereas they provide no consistent advantage on tasks with linear decision boundaries. Furthermore, we identify a structured regime in which curvature accumulates under composition, yielding a systematic depth amplification effect.
翻译:乘法门控广泛用于神经架构,近期被应用于大型语言模型的注意力层以提升性能与训练稳定性。尽管门控注意力已取得显著成功,其背后的数学机理仍缺乏理解。我们通过将输出建模为高斯分布的均值参数,并分析由此诱导的Fisher-Rao几何结构,从表征几何角度研究注意力机制。研究表明:无门控注意力算子因其仿射结构被限制于内在平坦的统计流形,而乘法门控能够实现非平坦几何,包括无门控设置下无法达到的正曲率流形。这些结果确立了无门控与门控注意力之间存在的几何表达性差异。实验表明:门控模型展现出更高的表征曲率,在需要非线性决策边界的任务中性能更优,但在线性决策边界任务中未见持续优势。此外,我们识别出一种结构化机制——曲率在复合运算中累积,产生系统性的深度放大效应。