We conduct a systematic study of the approximation properties of Transformer for sequence modeling with long, sparse and complicated memory. We investigate the mechanisms through which different components of Transformer, such as the dot-product self-attention, positional encoding and feed-forward layer, affect its expressive power, and we study their combined effects through establishing explicit approximation rates. Our study reveals the roles of critical parameters in the Transformer, such as the number of layers and the number of attention heads, and these insights also provide natural suggestions for alternative architectures.
翻译:我们系统研究了Transformer在处理长序列、稀疏及复杂记忆时的近似性质。通过建立显式逼近率,我们揭示了点积自注意力、位置编码和前馈层等不同组件影响其表达能力的作用机制,并分析了这些组件的协同效应。本研究阐明了Transformer中关键参数(如层数与注意力头数)的作用,这些发现也为替代架构提供了自然的设计思路。