Transformers are widely used as a general-purpose substrate for learning complex correlations between a large collection of coupled variables, but their internal mechanisms have remained mysterious. We introduce a theory of a deep transformer as a mean-field interacting system that implements distributed inference, subject to constraints on communication, locality and depth. We show that such a system can exploit internal state representations ('function vectors') to infer a latent context variable at increasingly finer scales over its layers. In an in-context regression task, the theory predicts a non-trivial relationship between non-Gaussian, hierarchical structure in the latent context variable, and transformer depth. Predictions are tested using constrained linear attention transformers and demonstrate adaptive inference in deep architectures. Feedforward blocks and depth enable transformers to implement a much richer class of in-context learning algorithms than previously described.
翻译:Transformer作为通用计算基座被广泛用于学习大量耦合变量间的复杂关联,但其内部机制仍属未解之谜。我们提出一种将深度Transformer视为平均场相互作用系统的理论框架,该系统在通讯约束、局部性和深度限制下实现分布式推理。研究表明,此类系统可通过利用内部状态表征("函数向量"),在其各层间逐步精细化地推断潜在上下文变量。在情境内回归任务中,该理论预测了潜在上下文变量的非高斯层次结构与Transformer深度之间存在非平凡关联。通过约束线性注意力Transformer进行实验验证,结果表明深度架构中确实存在自适应推理机制。前馈模块与深度共同赋予Transformer实现比现有描述更为丰富的情境内学习算法的能力。