We investigate the universal approximation property (UAP) of transformer-type architectures, providing a unified theoretical framework that extends prior results on residual networks to models incorporating attention mechanisms. Our work identifies token distinguishability as a fundamental requirement for UAP and introduces a general sufficient condition that applies to a broad class of architectures. Leveraging an analyticity assumption on the attention layer, we can significantly simplify the verification of this condition, providing a non-constructive approach in establishing UAP for such architectures. We demonstrate the applicability of our framework by proving UAP for transformers with various attention mechanisms, including kernel-based and sparse attention mechanisms. The corollaries of our results either generalize prior works or establish UAP for architectures not previously covered. Furthermore, our framework offers a principled foundation for designing novel transformer architectures with inherent UAP guarantees, including those with specific functional symmetries. We propose examples to illustrate these insights.
翻译:本文研究了Transformer型架构的通用逼近性质,提出了一个统一的理论框架,将先前关于残差网络的结果扩展至包含注意力机制的模型。我们的工作指出,令牌可区分性是实现通用逼近性质的基本要求,并提出了适用于广泛架构类别的通用充分条件。借助对注意力层的解析性假设,我们可以显著简化该条件的验证过程,为此类架构的通用逼近性证明提供了一种非构造性方法。我们通过证明多种注意力机制下Transformer的通用逼近性质,展示了本框架的适用性,包括基于核的注意力机制和稀疏注意力机制。我们结果的推论或推广了先前工作,或为先前未涵盖的架构确立了通用逼近性质。此外,本框架为设计具有固有通用逼近性保证的新型Transformer架构提供了原则性基础,包括具有特定函数对称性的架构。我们提出了若干示例以阐明这些见解。