We prove that transformers can exactly interpolate datasets of finite input sequences in $\mathbb{R}^d$, $d\geq 2$, with corresponding output sequences of smaller or equal length. Specifically, given $N$ sequences of arbitrary but finite lengths in $\mathbb{R}^d$ and output sequences of lengths $m^1, \dots, m^N \in \mathbb{N}$, we construct a transformer with $\mathcal{O}(\sum_{j=1}^N m^j)$ blocks and $\mathcal{O}(d \sum_{j=1}^N m^j)$ parameters that exactly interpolates the dataset. Our construction provides complexity estimates that are independent of the input sequence length, by alternating feed-forward and self-attention layers and by capitalizing on the clustering effect inherent to the latter. Our novel constructive method also uses low-rank parameter matrices in the self-attention mechanism, a common feature of practical transformer implementations. These results are first established in the hardmax self-attention setting, where the geometric structure permits an explicit and quantitative analysis, and are then extended to the softmax setting. Finally, we demonstrate the applicability of our exact interpolation construction to learning problems, in particular by providing convergence guarantees to a global minimizer under regularized training strategies. Our analysis contributes to the theoretical understanding of transformer models, offering an explanation for their excellent performance in exact sequence-to-sequence interpolation tasks.
翻译:我们证明了Transformer能够精确插值定义在$\mathbb{R}^d$($d\geq 2$)上的有限输入序列数据集及其对应长度相等或更短的输出序列。具体而言,给定$N$个任意有限长度的$\mathbb{R}^d$序列以及长度分别为$m^1, \dots, m^N \in \mathbb{N}$的输出序列,我们构建了一个具有$\mathcal{O}(\sum_{j=1}^N m^j)$个模块和$\mathcal{O}(d \sum_{j=1}^N m^j)$个参数的Transformer,该模型能够精确插值该数据集。通过交替使用前馈层与自注意力层,并利用后者固有的聚类效应,我们的构造提供了与输入序列长度无关的复杂度估计。我们新颖的构造方法还在自注意力机制中使用了低秩参数矩阵,这是实际Transformer实现的常见特征。这些结果首先在hardmax自注意力设定下建立,其几何结构允许进行显式定量分析,随后被推广至softmax设定。最后,我们展示了精确插值构造在学习问题中的适用性,特别通过正则化训练策略为全局最小化器提供了收敛性保证。我们的分析有助于深化对Transformer模型的理论理解,为其在精确序列到序列插值任务中的优异性能提供了解释。