We conduct a systematic study of the approximation properties of Transformer for sequence modeling with long, sparse and complicated memory. We investigate the mechanisms through which different components of Transformer, such as the dot-product self-attention, positional encoding and feed-forward layer, affect its expressive power, and we study their combined effects through establishing explicit approximation rates. Our study reveals the roles of critical parameters in the Transformer, such as the number of layers and the number of attention heads, and these insights also provide natural suggestions for alternative architectures.
翻译:我们系统研究了Transformer在具有长程、稀疏和复杂记忆的序列建模中的逼近性质。我们探究了点积自注意力、位置编码和前馈层等不同组件影响其表达能力的作用机理,并通过建立显式逼近速率研究了它们的联合效应。本研究揭示了Transformer中关键参数(如层数和注意力头数)的作用,这些见解也为替代性架构提供了自然的设计思路。