Analyzing Transformer Dynamics as Movement through Embedding Space

from arxiv, V2. Rewrote abstract. Rewrote / re-organized the entire paper into a more formal proposition/argument/result format. To shorten main paper length: Wrote more compact text in general, moved "negative self bias" and "encoder v/s decoder walks" sections to the appendix and packed figures. Styled as TMLR

Transformer based language models exhibit intelligent behaviors such as understanding natural language, recognizing patterns, acquiring knowledge, reasoning, planning, reflecting and using tools. This paper explores how their underlying mechanics give rise to intelligent behaviors. Towards that end, we propose framing Transformer dynamics as movement through embedding space. Examining Transformers through this perspective reveals key insights, establishing a Theory of Transformers: 1) Intelligent behaviours map to paths in Embedding Space which, the Transformer random-walks through during inferencing. 2) LM training learns a probability distribution over all possible paths. `Intelligence' is learnt by assigning higher probabilities to paths representing intelligent behaviors. No learning can take place in-context; context only narrows the subset of paths sampled during decoding. 5) The Transformer is a self-mapping composition function, folding a context sequence into a context-vector such that it's proximity to a token-vector reflects its co-occurrence and conditioned probability. Thus, the physical arrangement of vectors in Embedding Space determines path probabilities. 6) Context vectors are composed by aggregating features of the sequence's tokens via a process we call the encoding walk. Attention contributes a - potentially redundant - association-bias to this process. 7) This process is comprised of two principal operation types: filtering (data independent) and aggregation (data dependent). This generalization unifies Transformers with other sequence models. Building upon this foundation, we formalize a popular semantic interpretation of embeddings into a ``concept-space theory'' and find some evidence of it's validity.

翻译：基于Transformer的语言模型展现出诸多智能行为，例如理解自然语言、识别模式、获取知识、推理、规划、反思以及使用工具。本文探讨其底层机制如何催生智能行为。为此，我们提出将Transformer动力学视为嵌入空间中的运动。通过这一视角审视Transformer，可揭示关键洞见，从而建立Transformer理论体系：1）智能行为对应嵌入空间中的路径，而Transformer在推理过程中会沿这些路径进行随机游走；2）语言模型训练学习的是所有可能路径上的概率分布，“智能”通过学习为表征智能行为的路径赋予更高概率而习得；3）上下文本身无法实现学习，其作用仅在于缩小解码过程中采样的路径子集；5）Transformer本质上是一种自映射复合函数，它将上下文序列折叠为上下文向量，该向量与词元向量的接近程度反映其共现性与条件概率——因此嵌入空间中向量的物理排布决定了路径概率；6）上下文向量通过聚合序列中各词元的特征（我们称之为编码游走过程）组合而成，注意力机制则为此过程提供（可能具有冗余性的）关联偏置；7）该过程包含两类基本操作：过滤（数据无关型）与聚合（数据相关型）。这一泛化框架使Transformer与其他序列模型实现了统一。基于此基础，我们将嵌入的流行语义解释形式化为“概念空间理论”，并发现了支持该理论有效性的若干证据。