Despite their omnipresence in modern NLP, characterizing the computational power of transformer neural nets remains an interesting open question. We prove that transformers whose arithmetic precision is logarithmic in the number of input tokens (and whose feedforward nets are computable using space linear in their input) can be simulated by constant-depth logspace-uniform threshold circuits. This provides insight on the power of transformers using known results in complexity theory. For example, if $\mathsf L \neq \mathsf P$ (i.e., not all poly-time problems can be solved using logarithmic space), then transformers cannot even accurately solve linear equalities or check membership in an arbitrary context-free grammar with empty productions. Our result intuitively emerges from the transformer architecture's high parallelizability. We thus speculatively introduce the idea of a fundamental parallelism tradeoff: any model architecture as parallelizable as the transformer will obey limitations similar to it. Since parallelism is key to training models at massive scale, this suggests a potential inherent weakness of the scaling paradigm.
翻译:尽管Transformer在现代自然语言处理中无处不在,但其神经网络计算能力的刻画仍是一个待解的重要问题。我们证明了算数精度随输入令牌数对数增长(且其前馈网络在输入长度线性空间内可计算)的Transformer可由常数深度对数空间一致阈值电路模拟。借助复杂度理论的已知结论,这一结果为理解Transformer的能力提供了洞见。例如,若$\mathsf L \neq \mathsf P$(即并非所有多项式时间问题都能在对数空间内求解),则Transformer甚至无法准确求解线性等式问题,也无法判断任意带空产生式的上下文无关文法中的成员关系。该结论直观上源于Transformer架构的高度并行化特性。据此我们推测性地提出一个基本并行性权衡概念:任何与Transformer具有同等并行能力的模型架构都将受制于类似的局限性。由于并行性是大规模训练模型的关键,这暗示了规模化范式可能存在的内在缺陷。