We conduct a systematic study of the approximation properties of Transformer for sequence modeling with long, sparse and complicated memory. We investigate the mechanisms through which different components of Transformer, such as the dot-product self-attention, positional encoding and feed-forward layer, affect its expressive power, and we study their combined effects through establishing explicit approximation rates. Our study reveals the roles of critical parameters in the Transformer, such as the number of layers and the number of attention heads. These theoretical insights are validated experimentally and offer natural suggestions for alternative architectures.
翻译:我们对Transformer在具有长程、稀疏及复杂记忆的序列建模中的近似性质进行了系统研究。我们探究了Transformer各组件(如点积自注意力机制、位置编码和前馈层)影响其表达能力的机制,并通过建立显式近似率来研究它们的综合效应。我们的研究揭示了Transformer中关键参数(如层数和注意力头数)的作用。这些理论洞见通过实验得到验证,并为替代架构提供了自然的设计建议。