The transformer architecture has prevailed in various deep learning settings due to its exceptional capabilities to select and compose structural information. Motivated by these capabilities, Sanford et al. proposed the sparse token selection task, in which transformers excel while fully-connected networks (FCNs) fail in the worst case. Building upon that, we strengthen the FCN lower bound to an average-case setting and establish an algorithmic separation of transformers over FCNs. Specifically, a one-layer transformer trained with gradient descent provably learns the sparse token selection task and, surprisingly, exhibits strong out-of-distribution length generalization. We provide empirical simulations to justify our theoretical findings.
翻译:Transformer架构因其在选取与组合结构化信息方面的卓越能力,已在多种深度学习场景中占据主导地位。受此能力启发,Sanford等人提出了稀疏令牌选择任务,在该任务中Transformer表现出色,而全连接网络(FCN)在最坏情况下则无法胜任。在此基础上,我们将FCN的下界结果强化至平均情况设定,并确立了Transformer相对于FCN的算法分离性。具体而言,经梯度下降训练的单层Transformer可证明能够学习稀疏令牌选择任务,并且令人惊讶地展现出强大的分布外长度泛化能力。我们通过实证模拟验证了理论发现。