On the Expressivity Role of LayerNorm in Transformers' Attention - 专知论文

会员服务 ·

0

Attention · 向量化 · 层 · 规范化的 · Projection ·

2023 年 5 月 11 日

On the Expressivity Role of LayerNorm in Transformers' Attention

翻译：论LayerNorm在Transformer注意力机制中的表达性作用

Shaked Brody,Uri Alon,Eran Yahav

from arxiv, Accepted as a short paper in Findings of ACL 2023

Layer Normalization (LayerNorm) is an inherent component in all Transformer-based models. In this paper, we show that LayerNorm is crucial to the expressivity of the multi-head attention layer that follows it. This is in contrast to the common belief that LayerNorm's only role is to normalize the activations during the forward pass, and their gradients during the backward pass. We consider a geometric interpretation of LayerNorm and show that it consists of two components: (a) projection of the input vectors to a $d-1$ space that is orthogonal to the $\left[1,1,...,1\right]$ vector, and (b) scaling of all vectors to the same norm of $\sqrt{d}$. We show that each of these components is important for the attention layer that follows it in Transformers: (a) projection allows the attention mechanism to create an attention query that attends to all keys equally, offloading the need to learn this operation by the attention; and (b) scaling allows each key to potentially receive the highest attention, and prevents keys from being "un-select-able". We show empirically that Transformers do indeed benefit from these properties of LayeNorm in general language modeling and even in computing simple functions such as "majority". Our code is available at https://github.com/tech-srl/layer_norm_expressivity_role .

翻译：层归一化（Layer Normalization，简称LayerNorm）是所有基于Transformer的模型中固有的组成部分。本文表明，LayerNorm对其后的多头注意力层的表达性至关重要。这不同于普遍认为LayerNorm仅在前向传播中归一化激活值、在反向传播中归一化梯度的观点。我们考虑LayerNorm的几何解释，并表明它包含两个组成部分：（a）将输入向量投影到与$\left[1,1,...,1\right]$向量正交的$d-1$维空间，以及（b）将所有向量缩放至相同的$\sqrt{d}$范数。我们证明这两个组成部分对Transformer中其后的注意力层均具有重要意义：（a）投影使注意力机制能够创建同等关注所有键的注意力查询，从而省去注意力层学习此操作的需求；（b）缩放使每个键可能获得最高注意力，并防止键变得“不可选择”。我们通过实验证明，Transformer在通用语言建模乃至“多数投票”等简单函数计算中，确实受益于LayerNorm的这些特性。我们的代码可在https://github.com/tech-srl/layer_norm_expressivity_role获取。

0

相关内容

Attention

最新《Transformers模型》教程，64页ppt

最新《Transformers模型》教程，64页ppt

专知会员服务

326+阅读 · 2020年11月26日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

Transformer文本分类代码

Transformer文本分类代码

专知会员服务

118+阅读 · 2020年2月3日

【ICLR2020论文】自我注意力与卷积层的关系，On the Relationship between Self-Attention and Convolutional Layers

【ICLR2020论文】自我注意力与卷积层的关系，On the Relationship between Self-Attention and Convolutional Layers

专知会员服务

37+阅读 · 2020年1月12日

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

专知会员服务

15+阅读 · 2019年10月23日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

ExBert — 可视化分析Transformer学到的表示

ExBert — 可视化分析Transformer学到的表示

专知会员服务

32+阅读 · 2019年10月16日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

专知会员服务

84+阅读 · 2019年10月9日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

BERT/Transformer/迁移学习NLP资源大列表

BERT/Transformer/迁移学习NLP资源大列表

专知

19+阅读 · 2019年6月9日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【推荐】ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

【推荐】ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

机器学习研究会

20+阅读 · 2017年12月17日

【论文】图上的表示学习综述

【论文】图上的表示学习综述

机器学习研究会

15+阅读 · 2017年9月24日

【推荐】RNN/LSTM时序预测

【推荐】RNN/LSTM时序预测

机器学习研究会

25+阅读 · 2017年9月8日

广义欧拉多项式的实根性

国家自然科学基金

0+阅读 · 2015年12月31日

embAC与embB共突变对结核分枝杆菌乙胺丁醇药物敏感性的调节机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

小样本空间制图

国家自然科学基金

0+阅读 · 2012年12月31日

高k材料MOSFET沟道电子迁移率的增强研究

国家自然科学基金

0+阅读 · 2012年12月31日

实时安全关键系统的建模、仿真与验证

国家自然科学基金

1+阅读 · 2012年12月31日

Clusterin通过线粒体凋亡通路调节肝细胞肝癌化疗耐受机理的研究

国家自然科学基金

0+阅读 · 2011年12月31日

三江平原湿地小叶章群落结构与功能对氮沉降和CO2升高的响应

国家自然科学基金

0+阅读 · 2011年12月31日

金属有机聚合物骨架有序孔材料的可控制备与储氢性能研究

国家自然科学基金

0+阅读 · 2011年12月31日

可压Navier-Stokes方程及相关流体动力学方程研究

国家自然科学基金

0+阅读 · 2008年12月31日

胶体微粒的结构和表面性质对胞吞和细胞功能的调控

国家自然科学基金

0+阅读 · 2008年12月31日

Energy-Based Cross Attention for Bayesian Context Update in Text-to-Image Diffusion Models

Arxiv

0+阅读 · 2023年6月26日

Collaborative and Distributed Bayesian Optimization via Consensus: Showcasing the Power of Collaboration for Optimal Design

Arxiv

0+阅读 · 2023年6月25日

Knowledge-Infused Self Attention Transformers

Arxiv

0+阅读 · 2023年6月23日

Sharp analysis of EM for learning mixtures of pairwise differences

Arxiv

0+阅读 · 2023年6月22日

The quantum cost function concentration dependency on the parametrization expressivity

Arxiv

0+阅读 · 2023年6月22日

A Survey of Deep Causal Model

Arxiv

45+阅读 · 2022年9月19日

A Survey on Vision Transformer

Arxiv

17+阅读 · 2022年2月23日

Attention Bottlenecks for Multimodal Fusion

Arxiv

31+阅读 · 2021年6月30日

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Arxiv

15+阅读 · 2018年10月11日

Conditional Random Field and Deep Feature Learning for Hyperspectral Image Segmentation

Arxiv

11+阅读 · 2017年12月27日

VIP会员

文章信息

相关主题

最新内容

ICML 2026 | VOTP：用视频基础模型与最优传输，让离线偏好强化学习只需少量反馈

ICML 2026 | VOTP：用视频基础模型与最优传输，让离线偏好强化学习只需少量反馈

专知会员服务

1+阅读 · 55分钟前

多模态代码智能综述：从视觉输入到可执行代码系统

多模态代码智能综述：从视觉输入到可执行代码系统

专知会员服务

1+阅读 · 57分钟前

美国马六甲“三重网”概念：安全网、威慑网与杀伤网

美国马六甲“三重网”概念：安全网、威慑网与杀伤网

专知会员服务

3+阅读 · 今天8:18

《面向导弹有效发射时机的监督机器学习方法：基于超视距空战仿真》

《面向导弹有效发射时机的监督机器学习方法：基于超视距空战仿真》

专知会员服务

3+阅读 · 今天7:39

《通用大语言模型：无人机指挥与控制接口》最新40页

《通用大语言模型：无人机指挥与控制接口》最新40页

专知会员服务

9+阅读 · 今天7:33

《通过小型无人机系统将情报能力“作战化”》

《通过小型无人机系统将情报能力“作战化”》

专知会员服务

3+阅读 · 今天7:28

《神经安全型有人–无人协同：面向认知自适应作战能力的参考架构》

《神经安全型有人–无人协同：面向认知自适应作战能力的参考架构》

专知会员服务

6+阅读 · 今天7:14

《在指挥链中通过多准则决策分析传达指挥官意图：空战实验》

《在指挥链中通过多准则决策分析传达指挥官意图：空战实验》

专知会员服务

18+阅读 · 6月15日

消耗优势：美军的“精确规模化”概念

消耗优势：美军的“精确规模化”概念

专知会员服务

7+阅读 · 6月15日

五角大楼的AI优先战略及其对现代战争的启示：来自与伊朗冲突的经验教训

五角大楼的AI优先战略及其对现代战争的启示：来自与伊朗冲突的经验教训

专知会员服务

8+阅读 · 6月15日

《网络空间兵棋推演：挑战、局限性与混合路径》报告

《网络空间兵棋推演：挑战、局限性与混合路径》报告

专知会员服务

8+阅读 · 6月15日

《离线语言支持系统：面向空战战术决策》

《离线语言支持系统：面向空战战术决策》

专知会员服务

8+阅读 · 6月15日

《以通信为中心的6G–LLM架构：面向可扩展的战术自主防御车辆网络》

《以通信为中心的6G–LLM架构：面向可扩展的战术自主防御车辆网络》

专知会员服务

7+阅读 · 6月15日

ICML 2026｜ECA：面向开放式图文生成的高效持续对齐

ICML 2026｜ECA：面向开放式图文生成的高效持续对齐

专知会员服务

6+阅读 · 6月14日

可信智能体AI综述：安全、鲁棒性、隐私与系统安全

可信智能体AI综述：安全、鲁棒性、隐私与系统安全

专知会员服务

6+阅读 · 6月14日

相关VIP内容

最新《Transformers模型》教程，64页ppt

最新《Transformers模型》教程，64页ppt

专知会员服务

326+阅读 · 2020年11月26日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

Transformer文本分类代码

Transformer文本分类代码

专知会员服务

118+阅读 · 2020年2月3日

【ICLR2020论文】自我注意力与卷积层的关系，On the Relationship between Self-Attention and Convolutional Layers

【ICLR2020论文】自我注意力与卷积层的关系，On the Relationship between Self-Attention and Convolutional Layers

专知会员服务

37+阅读 · 2020年1月12日

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

专知会员服务

15+阅读 · 2019年10月23日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

ExBert — 可视化分析Transformer学到的表示

ExBert — 可视化分析Transformer学到的表示

专知会员服务

32+阅读 · 2019年10月16日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

专知会员服务

84+阅读 · 2019年10月9日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

多模态代码智能综述：从视觉输入到可执行代码系统

《面向导弹有效发射时机的监督机器学习方法：基于超视距空战仿真》

ICML 2026 | VOTP：用视频基础模型与最优传输，让离线偏好强化学习只需少量反馈

美国马六甲“三重网”概念：安全网、威慑网与杀伤网

相关资讯

BERT/Transformer/迁移学习NLP资源大列表

BERT/Transformer/迁移学习NLP资源大列表

专知

19+阅读 · 2019年6月9日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【推荐】ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

【推荐】ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

机器学习研究会

20+阅读 · 2017年12月17日

【论文】图上的表示学习综述

【论文】图上的表示学习综述

机器学习研究会

15+阅读 · 2017年9月24日

【推荐】RNN/LSTM时序预测

【推荐】RNN/LSTM时序预测

机器学习研究会

25+阅读 · 2017年9月8日

相关论文

Energy-Based Cross Attention for Bayesian Context Update in Text-to-Image Diffusion Models

Arxiv

0+阅读 · 2023年6月26日

Collaborative and Distributed Bayesian Optimization via Consensus: Showcasing the Power of Collaboration for Optimal Design

Arxiv

0+阅读 · 2023年6月25日

Knowledge-Infused Self Attention Transformers

Arxiv

0+阅读 · 2023年6月23日

Sharp analysis of EM for learning mixtures of pairwise differences

Arxiv

0+阅读 · 2023年6月22日

The quantum cost function concentration dependency on the parametrization expressivity

Arxiv

0+阅读 · 2023年6月22日

A Survey of Deep Causal Model

Arxiv

45+阅读 · 2022年9月19日

A Survey on Vision Transformer

Arxiv

17+阅读 · 2022年2月23日

Attention Bottlenecks for Multimodal Fusion

Arxiv

31+阅读 · 2021年6月30日

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Arxiv

15+阅读 · 2018年10月11日

Conditional Random Field and Deep Feature Learning for Hyperspectral Image Segmentation

Arxiv

11+阅读 · 2017年12月27日

相关基金

广义欧拉多项式的实根性

国家自然科学基金

0+阅读 · 2015年12月31日

embAC与embB共突变对结核分枝杆菌乙胺丁醇药物敏感性的调节机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

小样本空间制图

国家自然科学基金

0+阅读 · 2012年12月31日

高k材料MOSFET沟道电子迁移率的增强研究

国家自然科学基金

0+阅读 · 2012年12月31日

实时安全关键系统的建模、仿真与验证

国家自然科学基金

1+阅读 · 2012年12月31日

Clusterin通过线粒体凋亡通路调节肝细胞肝癌化疗耐受机理的研究

国家自然科学基金

0+阅读 · 2011年12月31日

三江平原湿地小叶章群落结构与功能对氮沉降和CO2升高的响应

国家自然科学基金

0+阅读 · 2011年12月31日

金属有机聚合物骨架有序孔材料的可控制备与储氢性能研究

国家自然科学基金

0+阅读 · 2011年12月31日

可压Navier-Stokes方程及相关流体动力学方程研究

国家自然科学基金

0+阅读 · 2008年12月31日

胶体微粒的结构和表面性质对胞吞和细胞功能的调控

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员