The transformer architecture is almost a decade old. Despite that, we still have a limited understanding of what this architecture can or cannot compute. For instance, can a 1-layer transformer solve PARITY -- or more generally -- which kinds of transformers can do it? Known constructions for PARITY have at least 2 layers and employ impractical features: either a length-dependent positional encoding, or hardmax, or layernorm without the regularization parameter, or they are not implementable with causal masking. We give a new construction of a transformer for PARITY with softmax, length-independent and polynomially bounded positional encoding, no layernorm, working both with and without causal masking. We also give the first lower bound for transformers solving PARITY -- by showing that it cannot be done with only one layer and one head.
翻译:Transformer架构已问世近十年。尽管如此,我们对其计算能力与局限性的理解仍十分有限。例如,单层Transformer能否求解奇偶性(PARITY)问题?更一般地,何种类型的Transformer能够求解该问题?已知的奇偶性求解构造至少需要两层网络,并依赖非实用化特征:要么采用长度相关的位置编码,要么使用硬最大值(hardmax)函数,要么采用无正则化参数的层归一化(layernorm),要么无法通过因果掩码实现。本文提出一种新的Transformer构造方案,该方案采用软最大值(softmax)函数、长度无关且多项式有界的位置编码,无需层归一化,同时适用于因果掩码与非因果掩码场景。此外,我们首次证明了Transformer求解奇偶性问题的下界——通过证明单层单头Transformer无法完成该任务。