We conduct a systematic study of the approximation properties of Transformer for sequence modeling with long, sparse and complicated memory. We investigate the mechanisms through which different components of Transformer, such as the dot-product self-attention, positional encoding and feed-forward layer, affect its expressive power, and we study their combined effects through establishing explicit approximation rates. Our study reveals the roles of critical parameters in the Transformer, such as the number of layers and the number of attention heads, and these insights also provide natural suggestions for alternative architectures.
翻译:我们系统研究了Transformer在处理长序列、稀疏序列及复杂记忆序列建模中的逼近性质。深入探究了点积自注意力、位置编码和前馈层等不同组件影响其表达能力的内在机理,并通过建立显式逼近率来研究其联合作用效果。本研究揭示了层数与注意力头数等关键参数在Transformer中的功能角色,这些洞见为替代性架构的设计提供了自然启示。