Instructing the model to generate a sequence of intermediate steps, a.k.a., a chain of thought (CoT), is a highly effective method to improve the accuracy of large language models (LLMs) on arithmetics and symbolic reasoning tasks. However, the mechanism behind CoT remains unclear. This work provides a theoretical understanding of the power of CoT for decoder-only transformers through the lens of expressiveness. Conceptually, CoT empowers the model with the ability to perform inherently serial computation, which is otherwise lacking in transformers, especially when depth is low. Given input length $n$, previous works have shown that constant-depth transformers with finite precision $\mathsf{poly}(n)$ embedding size can only solve problems in $\mathsf{TC}^0$ without CoT. We first show an even tighter expressiveness upper bound for constant-depth transformers with constant-bit precision, which can only solve problems in $\mathsf{AC}^0$, a proper subset of $ \mathsf{TC}^0$. However, with $T$ steps of CoT, constant-depth transformers using constant-bit precision and $O(\log n)$ embedding size can solve any problem solvable by boolean circuits of size $T$. Empirically, enabling CoT dramatically improves the accuracy for tasks that are hard for parallel computation, including the composition of permutation groups, iterated squaring, and circuit value problems, especially for low-depth transformers.
翻译:指导模型生成一系列中间步骤,即思维链(CoT),是提升大语言模型在算术与符号推理任务中准确率的高效方法。然而,CoT背后的机制尚不明确。本文通过表达性视角为解码器专用Transformer中CoT的能力提供理论理解。概念上,CoT赋予模型执行固有序列计算的能力,而这种能力在Transformer(特别是低深度模型)中原本缺失。给定输入长度$n$,先前研究表明,采用有限精度$\mathsf{poly}(n)$嵌入维度的常数深度Transformer在无CoT时仅能解决$\mathsf{TC}^0$类问题。我们首先给出常数位精度常数深度Transformer的更紧表达性上界,其仅能解决$\mathsf{AC}^0$类问题($\mathsf{TC}^0$的真子集)。然而,通过$T$步CoT,采用常数位精度和$O(\log n)$嵌入维度的常数深度Transformer可解决任意布尔电路规模为$T$的问题。实验表明,启用CoT显著提升了并行计算困难任务(包括置换群合成、迭代平方和电路值问题)的准确率,尤其对于低深度Transformer。