Standard chain-of-thought reasoning generates a solution in a single forward pass, committing irrevocably to each token and lacking a mechanism to recover from early errors. We introduce Inference-Time Rethinking, a generative framework that enables iterative self-correction by decoupling declarative latent thought vectors from procedural generation. We factorize reasoning into a continuous latent thought vector (what to reason about) and a decoder that verbalizes the trace conditioned on this vector (how to reason). Beyond serving as a declarative buffer, latent thought vectors compress the reasoning structure into a continuous representation that abstracts away surface-level token variability, making gradient-based optimization over reasoning strategies well-posed. Our prior model maps unstructured noise to a learned manifold of valid reasoning patterns, and at test time we employ a Gibbs-style procedure that alternates between generating a candidate trace and optimizing the latent vector to better explain that trace, effectively navigating the latent manifold to refine the reasoning strategy. Training a 0.2B-parameter model from scratch on GSM8K, our method with 30 rethinking iterations surpasses baselines with 10 to 15 times more parameters, including a 3B counterpart. This result demonstrates that effective mathematical reasoning can emerge from sophisticated inference-time computation rather than solely from massive parameter counts.
翻译:标准思维链推理在单次前向传播中生成解决方案,对每个标记做出不可撤销的承诺,且缺乏从早期错误中恢复的机制。我们提出推理时反思这一生成框架,通过将声明性潜在思维向量与程序化生成过程解耦,实现迭代式自我校正。我们将推理分解为连续潜在思维向量(推理内容)和基于该向量条件化生成推理轨迹的解码器(推理方式)。潜在思维向量不仅作为声明性缓冲区,还将推理结构压缩为连续表示,从而抽象掉表层标记的变异性,使得基于梯度的推理策略优化问题得以适定。我们的先验模型将非结构化噪声映射到已学习的有效推理模式流形,在测试阶段采用吉布斯式交替过程:首先生成候选推理轨迹,随后优化潜在向量以更好地解释该轨迹,从而有效遍历潜在流形以优化推理策略。在GSM8K数据集上从头训练一个0.2B参数的模型,经过30次反思迭代后,我们的方法超越了参数规模大10至15倍的基线模型,包括一个3B参数的对比模型。这一结果表明,有效的数学推理能力可以通过精密的推理时计算产生,而不仅仅依赖于海量参数规模。