Despite their omnipresence in modern NLP, characterizing the computational power of transformer neural nets remains an interesting open question. We prove that transformers whose arithmetic precision is logarithmic in the number of input tokens (and whose feedforward nets are computable using space linear in their input) can be simulated by constant-depth logspace-uniform threshold circuits. This provides insight on the power of transformers using known results in complexity theory. For example, if $\mathsf L \neq \mathsf P$ (i.e., not all poly-time problems can be solved using logarithmic space), then transformers cannot even accurately solve linear equalities or check membership in an arbitrary context-free grammar with empty productions. Our result intuitively emerges from the transformer architecture's high parallelizability. We thus speculatively introduce the idea of a fundamental parallelism tradeoff: any model architecture as parallelizable as the transformer will obey limitations similar to it. Since parallelism is key to training models at massive scale, this suggests a potential inherent weakness of the scaling paradigm.
翻译:尽管Transformer在现代自然语言处理中无处不在,但刻画其神经网络的算力仍然是一个有趣的未解问题。我们证明,算术精度与输入令牌数量呈对数关系(且其前馈网络可通过与输入呈线性关系的空间计算)的Transformer,可以被常数深度对数空间一致阈值电路模拟。这借助复杂性理论的已知结果揭示了Transformer的能力。例如,若$\mathsf L \neq \mathsf P$(即并非所有多项式时间问题都可用对数空间求解),则Transformer甚至无法准确求解线性等式或检查任意带有空产生式的上下文无关文法的成员资格。我们的结果直观上源于Transformer架构的高度并行性。因此,我们推测性地提出一个基本并行性权衡的观点:任何与Transformer同等可并行的模型架构都将遵循类似的限制。由于并行性是大规模训练模型的关键,这暗示了扩展范式可能存在的固有弱点。