Theoretical efforts to prove advantages of Transformers in comparison with classical architectures such as feedforward and recurrent neural networks have mostly focused on representational power. In this work, we take an alternative perspective and prove that even with infinite compute, feedforward and recurrent networks may suffer from larger sample complexity compared to Transformers, as the latter can adapt to a form of dynamic sparsity. Specifically, we consider a sequence-to-sequence data generating model on sequences of length $N$, in which the output at each position depends only on $q$ relevant tokens with $q \ll N$, and the positions of these tokens are described in the input prompt. We prove that a single-layer Transformer can learn this model if and only if its number of attention heads is at least $q$, in which case it achieves a sample complexity almost independent of $N$, while recurrent networks require $N^{\Omega(1)}$ samples on the same problem. If we simplify this model, recurrent networks may achieve a complexity almost independent of $N$, while feedforward networks still require $N$ samples. Consequently, our proposed sparse retrieval model illustrates a natural hierarchy in sample complexity across these architectures.
翻译:关于Transformer相较于前馈与循环神经网络等经典架构优势的理论研究,主要集中在表示能力方面。本文采取另一种视角,证明即使计算资源无限,前馈与循环网络仍可能面临比Transformer更高的样本复杂度,因为后者能够适应一种动态稀疏性。具体而言,我们考虑一个序列长度为 $N$ 的序列到序列数据生成模型,其中每个位置的输出仅依赖于 $q$ 个相关标记(满足 $q \ll N$),且这些标记的位置由输入提示描述。我们证明:单层Transformer能够学习该模型当且仅当其注意力头数不少于 $q$,此时其样本复杂度几乎与 $N$ 无关;而在相同问题上,循环网络需要 $N^{\Omega(1)}$ 个样本。若简化此模型,循环网络可能实现几乎与 $N$ 无关的复杂度,而前馈网络仍需 $N$ 个样本。因此,我们提出的稀疏检索模型揭示了这些架构在样本复杂度上存在一种自然的层次结构。