When writing and talking, people sometimes pause to think. Although reasoning-focused works have often framed reasoning as a method of answering questions or completing agentic tasks, reasoning is implicit in almost all written text. For example, this applies to the steps not stated between the lines of a proof or to the theory of mind underlying a conversation. In the Self-Taught Reasoner (STaR, Zelikman et al. 2022), useful thinking is learned by inferring rationales from few-shot examples in question-answering and learning from those that lead to a correct answer. This is a highly constrained setting -- ideally, a language model could instead learn to infer unstated rationales in arbitrary text. We present Quiet-STaR, a generalization of STaR in which LMs learn to generate rationales at each token to explain future text, improving their predictions. We address key challenges, including 1) the computational cost of generating continuations, 2) the fact that the LM does not initially know how to generate or use internal thoughts, and 3) the need to predict beyond individual next tokens. To resolve these, we propose a tokenwise parallel sampling algorithm, using learnable tokens indicating a thought's start and end, and an extended teacher-forcing technique. Encouragingly, generated rationales disproportionately help model difficult-to-predict tokens and improve the LM's ability to directly answer difficult questions. In particular, after continued pretraining of an LM on a corpus of internet text with Quiet-STaR, we find zero-shot improvements on GSM8K (5.9%$\rightarrow$10.9%) and CommonsenseQA (36.3%$\rightarrow$47.2%) and observe a perplexity improvement of difficult tokens in natural text. Crucially, these improvements require no fine-tuning on these tasks. Quiet-STaR marks a step towards LMs that can learn to reason in a more general and scalable way.
翻译:在写作和对话时,人们有时会暂停进行思考。尽管侧重推理的研究常将推理视为回答问题或完成智能任务的方法,但推理其实隐含在几乎所有书面文本中。例如,这体现在证明过程中未明言步骤的字里行间,或对话背后隐含的心智理论。在自学推理器(Self-Taught Reasoner, STaR, Zelikman等人2022年)中,通过从问答场景中的少样本示例推断出推理依据,并学习那些能导向正确答案的推理依据,从而学到有用的思考方式。但这是一种高度受限的设定——理想情况下,语言模型应能学会推断任意文本中未明言的推理依据。我们提出Quiet-STaR,这是STaR的扩展,使语言模型能在每个词元处生成推理依据以解释后续文本,从而提升其预测能力。我们解决了若干关键挑战,包括:1) 生成连续文本的计算成本;2) 模型初始阶段不知如何生成或运用内部思考;3) 需预测超出单个后续词元的范围。为解决这些问题,我们提出一种逐词元并行采样算法,引入可学习的起始与结束标记来标识思考过程,并结合扩展的教师强制技术。令人鼓舞的是,生成的推理依据能显著帮助模型预测难以推断的词元,并提升模型直接回答困难问题的能力。具体而言,在对语言模型进行互联网文本语料的Quiet-STaR持续预训练后,我们发现零样本性能在GSM8K(5.9%→10.9%)和CommonsenseQA(36.3%→47.2%)上均有提升,且自然文本中困难词元的困惑度得到改善。关键在于,这些提升无需对目标任务进行微调。Quiet-STaR标志着语言模型向能以更通用、更可扩展的方式学习推理迈出了关键一步。