The transformer is a neural network component that can be used to learn useful representations of sequences or sets of data-points. The transformer has driven recent advances in natural language processing, computer vision, and spatio-temporal modelling. There are many introductions to transformers, but most do not contain precise mathematical descriptions of the architecture and the intuitions behind the design choices are often also missing. Moreover, as research takes a winding path, the explanations for the components of the transformer can be idiosyncratic. In this note we aim for a mathematically precise, intuitive, and clean description of the transformer architecture. We will not discuss training as this is rather standard. We assume that the reader is familiar with fundamental topics in machine learning including multi-layer perceptrons, linear transformations, softmax functions and basic probability.
翻译:Transformer 是一种神经网络组件,可用于学习序列或数据点集的有用表示。Transformer 推动了自然语言处理、计算机视觉和时空建模领域的最新进展。目前已有许多关于 Transformer 的介绍,但大多未包含对架构的精确数学描述,且往往缺失设计选择背后的直观解释。此外,由于研究路径的曲折性,对 Transformer 各组成部分的解释可能存在特殊性。本文旨在对 Transformer 架构进行数学精确、直观且清晰的阐述。我们将不讨论训练过程,因其已较为标准化。我们假设读者熟悉机器学习的基础知识,包括多层感知机、线性变换、softmax 函数及概率论基础。