It has been observed that transformers with greater depth (that is, more layers) have more capabilities, but can we establish formally which capabilities are gained? We answer this question with a theoretical proof followed by an empirical study. First, we consider transformers that round to fixed precision except inside attention. We show that this subclass of transformers is expressively equivalent to the programming language C-RASP and this equivalence preserves depth. Second, we prove that deeper C-RASP programs are more expressive than shallower C-RASP programs, implying that deeper transformers are more expressive than shallower transformers (within the subclass mentioned above). The same is also proven for transformers with positional encodings (like RoPE and ALiBi). These results are established by studying a temporal logic with counting operators equivalent to C-RASP. Finally, we provide empirical evidence that our theory predicts the depth required for transformers without positional encodings to length-generalize on a family of sequential dependency tasks.
翻译:已有观察表明,深度更大(即层数更多)的Transformer具备更强的能力,但我们能否从形式上确定具体获得了哪些能力?本文通过理论证明与实证研究相结合的方式回答了该问题。首先,我们考察了除注意力机制内部外均采用固定精度舍入的Transformer子类,证明其在表达能力上等价于编程语言C-RASP,且该等价关系保持深度不变。其次,我们证明深度更大的C-RASP程序比浅层程序具有更强的表达能力,这意味着在上述子类中,深层Transformer比浅层Transformer更具表达力。对于包含位置编码(如RoPE和ALiBi)的Transformer,该结论同样成立。这些结论通过研究与C-RASP等价的带计数运算符时态逻辑得以确立。最后,我们提供实证证据表明:本文理论能够预测无位置编码的Transformer在序列依赖任务族上实现长度泛化所需的深度。