We show that a constant number of self-attention layers can efficiently simulate, and be simulated by, a constant number of communication rounds of Massively Parallel Computation. As a consequence, we show that logarithmic depth is sufficient for transformers to solve basic computational tasks that cannot be efficiently solved by several other neural sequence models and sub-quadratic transformer approximations. We thus establish parallelism as a key distinguishing property of transformers.
翻译:我们证明恒定数量的自注意力层能够高效模拟,并被恒定数量的大规模并行计算通信轮次所模拟。由此,我们证明对数深度足以让Transformer解决若干其他神经序列模型及次二次Transformer近似无法高效求解的基本计算任务。因此,我们确立并行性为Transformer的关键区分特性。