Existing expressivity results for transformers typically rely on hardmax attention, high precision, and other architectural modifications that disconnect them from the models used in practice. We bridge this gap by analyzing standard transformer decoders with softmax attention and rounding of activations and attention weights, while allowing depth and width to grow logarithmically with the context length. As an intermediate step, we construct hardmax transformers with ternary activations and well-separated attention scores that simulate Turing machines using Chain-of-Thought (CoT). This lets us convert the constructions to equivalent softmax transformers without the unrealistic parameter magnitudes or activation precision that prior approaches would require. Using the same technique, we analyze a recently proposed summarized CoT paradigm and show that it simulates Turing machines more efficiently, with model size scaling logarithmically in a space bound rather than a time bound. We empirically test predictions made by our results on a Sudoku reasoning task and find better alignment with learnability than for prior high-precision results. Our code is available at https://github.com/moritzbroe/transformer-expressivity.
翻译:现有关于Transformer表达能力的研究大多依赖于硬注意力(hardmax attention)、高精度运算以及其他架构改进,这使得这些研究与实际使用的模型存在脱节。本文通过分析标准Transformer解码器(采用softmax注意力机制,并对激活值和注意力权重进行四舍五入)来弥合这一差距,同时允许深度和宽度随上下文长度对数增长。作为中间步骤,我们构建了具有三元激活值和良好分离注意力分数的硬注意力Transformer,这些Transformer利用思维链(Chain-of-Thought, CoT)模拟图灵机。这使我们能够将构造转化为等效的softmax Transformer,而无需像先前方法那样引入不现实的参数幅度或激活精度。运用相同技术,我们分析了近期提出的摘要式思维链(summarized CoT)范式,并证明它能更高效地模拟图灵机,其模型规模随空间界限而非时间界限对数增长。我们在数独推理任务上对我们的结果进行了实证检验,发现与先前高精度结果相比,其与可学习性的对齐度更高。我们的代码开源在https://github.com/moritzbroe/transformer-expressivity。