It has been observed that transformers with greater depth (that is, more layers) have more capabilities, but can we establish formally which capabilities are gained? We answer this question with a theoretical proof followed by an empirical study. First, we consider transformers that round to fixed precision except inside attention. We show that this subclass of transformers is expressively equivalent to the programming language C-RASP and this equivalence preserves depth. Second, we prove that deeper C-RASP programs are more expressive than shallower C-RASP programs, implying that deeper transformers are more expressive than shallower transformers (within the subclass mentioned above). The same is also proven for transformers with positional encodings (like RoPE and ALiBi). These results are established by studying a temporal logic with counting operators equivalent to C-RASP. Finally, we provide empirical evidence that our theory predicts the depth required for transformers without positional encodings to length-generalize on a family of sequential dependency tasks.
翻译:已有研究表明,层数更多(即深度更大)的Transformer模型具备更强的能力,但我们能否从形式化角度明确界定具体获得了哪些能力?本文通过理论证明与实证研究相结合的方式对此问题作出解答。首先,我们考察了除注意力机制内部外均采用固定精度舍入的Transformer子类,证明其在表达能力上等价于编程语言C-RASP,且该等价关系保持深度不变。其次,我们证明深度更大的C-RASP程序比浅层程序具有更强的表达能力,从而推导出在上述子类中,深层Transformer比浅层Transformer更具表达力。对于包含位置编码(如RoPE和ALiBi)的Transformer模型,我们也证明了相同的结论。这些结论是通过研究一种与C-RASP等价的带计数运算符时序逻辑而建立的。最后,我们通过实验证据表明,本文提出的理论能够预测无位置编码Transformer在序列依赖任务族上实现长度泛化所需的深度。