Instructing the model to generate a sequence of intermediate steps, a.k.a., a chain of thought (CoT), is a highly effective method to improve the accuracy of large language models (LLMs) on arithmetics and symbolic reasoning tasks. However, the mechanism behind CoT remains unclear. This work provides a theoretical understanding of the power of CoT for decoder-only transformers through the lens of expressiveness. Conceptually, CoT empowers the model with the ability to perform inherently serial computation, which is otherwise lacking in transformers, especially when depth is low. Given input length $n$, previous works have shown that constant-depth transformers with finite precision $\mathsf{poly}(n)$ embedding size can only solve problems in $\mathsf{TC}^0$ without CoT. We first show an even tighter expressiveness upper bound for constant-depth transformers with constant-bit precision, which can only solve problems in $\mathsf{AC}^0$, a proper subset of $ \mathsf{TC}^0$. However, with $T$ steps of CoT, constant-depth transformers using constant-bit precision and $O(\log n)$ embedding size can solve any problem solvable by boolean circuits of size $T$. Empirically, enabling CoT dramatically improves the accuracy for tasks that are hard for parallel computation, including the composition of permutation groups, iterated squaring, and circuit value problems, especially for low-depth transformers.
翻译:指导模型生成一系列中间步骤(即思维链)是提升大型语言模型在算术和符号推理任务中准确性的高效方法。然而,思维链的运作机制尚未明确。本研究从表达能力视角,为解码器专用Transformer的思维链能力提供了理论解释。从概念上看,思维链赋予模型执行固有串行计算的能力,这正是标准Transformer(尤其在低深度情况下)所欠缺的。给定输入长度$n$,前人研究表明,采用有限精度$\mathsf{poly}(n)$嵌入维度的常数深度Transformer在不使用思维链时仅能解决$\mathsf{TC}^0$类问题。我们首先给出了常数位精度常数深度Transformer的更紧表达能力上界,证明这类模型仅能解决$\mathsf{AC}^0$类问题——即$\mathsf{TC}^0$的真子集。然而,通过$T$步思维链,采用常数位精度与$O(\log n)$嵌入维度的常数深度Transformer可解决任何规模为$T$的布尔电路可解问题。实验表明,启用思维链能显著提升并行计算困难任务的准确性,包括置换群复合、迭代平方与电路值问题,尤其对低深度Transformer效果显著。