While long, explicit chains-of-thought (CoT) have proven effective on complex reasoning tasks, they are costly to generate during inference. Non-verbal reasoning methods have emerged with shorter generation lengths by leveraging continuous representations, yet their performance lags behind verbalized CoT. We propose $\textbf{Abstract Chain-of-Thought}$, a discrete latent reasoning post-training mechanism in which the language model produces a short sequence of tokens from a reserved vocabulary in lieu of a natural language CoT, before generating a response. To make previously unseen ''abstract'' tokens useful, we introduce a policy iteration-style warm-up loop that alternates between (i.) bottlenecking from a verbal CoT via masking and performing supervised fine-tuning, and (ii.) self-distillation by training the model to generate abstract tokens from the prompt alone via constrained decoding with the codebook. After warm-up, we optimize the generation of abstract sequences with warm-started reinforcement learning under constrained decoding. Abstract-CoT achieves up to $11.6\times$ fewer reasoning tokens while demonstrating comparable performance across mathematical reasoning, instruction-following, and multi-hop reasoning, and generalizes across language model families. We also find an emergent power law distribution over the abstract vocabulary, akin to those seen in natural language, that evolves across the training phases. Our findings highlight the potential for post-training latent reasoning mechanisms that enable efficient inference through a learned abstract reasoning language.
翻译:虽然长而显式的思维链(CoT)在复杂推理任务中已被证明有效,但其生成过程在推理阶段成本高昂。非语言推理方法通过利用连续表示实现了较短的生成长度,但其性能仍落后于语言化的CoT。本文提出$\textbf{抽象思维链}$(Abstract Chain-of-Thought),一种离散隐式推理后训练机制:语言模型在生成响应前,从预留词汇表中生成一个短序列标记,以替代自然语言思维链。为使此前未见过的“抽象”标记具有实用性,我们引入一种策略迭代式预热循环,交替执行以下步骤:(i) 通过掩码从语言化思维链中提取瓶颈信息并进行监督微调;(ii) 通过基于码本的约束解码训练模型仅从提示生成抽象标记,实现自蒸馏。预热阶段后,我们采用热启动强化学习结合约束解码优化抽象序列的生成。抽象思维链在数学推理、指令跟随和多跳推理任务中,推理标记数最高减少$11.6\times$倍,同时保持相当性能,并展现出跨语言模型家族的泛化能力。此外,我们观察到抽象词汇表上出现类似自然语言的新兴幂律分布,且该分布在训练阶段动态演化。本研究揭示了后训练隐式推理机制的潜力,可通过习得的抽象推理语言实现高效推理。