Transformer architectures have achieved remarkable empirical success in modeling contextual relations, yet a clear understanding of their expressive power is still lacking. In this work, we introduce a measure-theoretic framework in which contextual relations are modeled as probabilistic objects, either as conditional distributions or as joint distributions (couplings). This perspective reveals a natural connection between standard softmax attention and entropy-regularized optimal transport, providing a unified view of attention as a normalization of an underlying affinity function. Within this framework, we establish a universal approximation theorem for contextual systems using standard Softmax Attention and alternately Sinkhorn normalization. These results show that Transformer architectures can approximate arbitrary contextual relations rules, and that the choice of normalization determines how these relations are represented. Moreover, they provide a principled explanation for why Transformers are effective at modeling contextual relations.
翻译:Transformer架构在建模上下文关系方面取得了显著的经验成功,但对其表达能力的清晰理解仍显不足。本文引入了一种测度论框架,将上下文关系建模为概率对象(条件分布或联合分布/耦合)。该视角揭示了标准softmax注意力与熵正则化最优传输之间的自然联系,提供了将注意力视为底层亲和函数归一化的统一观点。在此框架内,我们利用标准softmax注意力与交替Sinkhorn归一化,建立了上下文系统的通用逼近定理。这些结果表明,Transformer架构能够逼近任意上下文关系规则,且归一化的选择决定了这些关系的表示方式。此外,这些结果从原理上解释了Transformer为何能有效建模上下文关系。