Instructing the model to generate a sequence of intermediate steps, a.k.a., a chain of thought (CoT), is a highly effective method to improve the accuracy of large language models (LLMs) on arithmetics and symbolic reasoning tasks. However, the mechanism behind CoT remains unclear. This work provides a theoretical understanding of the power of CoT for decoder-only transformers through the lens of expressiveness. Conceptually, CoT empowers the model with the ability to perform inherently serial computation, which is otherwise lacking in transformers, especially when depth is low. Given input length $n$, previous works have shown that constant-depth transformers with finite precision $\mathsf{poly}(n)$ embedding size can only solve problems in $\mathsf{TC}^0$ without CoT. We first show an even tighter expressiveness upper bound for constant-depth transformers with constant-bit precision, which can only solve problems in $\mathsf{AC}^0$, a proper subset of $ \mathsf{TC}^0$. However, with $T$ steps of CoT, constant-depth transformers using constant-bit precision and $O(\log n)$ embedding size can solve any problem solvable by boolean circuits of size $T$. Empirically, enabling CoT dramatically improves the accuracy for tasks that are hard for parallel computation, including the composition of permutation groups, iterated squaring, and circuit value problems, especially for low-depth transformers.
翻译:指令模型生成一系列中间步骤(即思维链,CoT)是提升大型语言模型在算术和符号推理任务上准确率的有效方法。然而,CoT背后的工作机制尚不明确。本文从表达能力角度为仅含解码器的Transformer中CoT的效能提供理论解释。概念上,CoT使模型获得了执行固有串行计算的能力——这种能力原本是Transformer(尤其在深度较浅时)所欠缺的。给定输入长度$n$,已有研究表明有限精度$\mathsf{poly}(n)$嵌入维度的恒定深度Transformer在没有CoT时仅能解决$\mathsf{TC}^0$类别的问题。我们首先证明恒定比特精度下恒定深度Transformer具有更严格的能力上界——仅能解决$\mathsf{AC}^0$(即$\mathsf{TC}^0$的真子集)类别的问题。然而,通过$T$步CoT,使用恒定比特精度与$O(\log n)$嵌入维度的恒定深度Transformer能够解决任何可由规模为$T$的布尔电路计算的问题。实验表明,CoT能显著提升模型在并行计算困难任务上的准确率,包括置换群复合、迭代平方与电路值问题,这一提升对浅层Transformer尤为显著。