Humans ponder before articulating complex sentence elements, enabling deeper cognitive processing through focused effort. In this work, we introduce this pondering process into language models by repeatedly invoking the forward process within a single token generation step. During pondering, instead of generating an actual token sampled from the prediction distribution, the model ponders by yielding a weighted sum of all token embeddings according to the predicted token distribution. The generated embedding is then fed back as input for another forward pass. We show that the model can learn to ponder in this way through self-supervised learning, without any human annotations. Experiments across three widely used open-source architectures-GPT-2, Pythia, and LLaMA-and extensive downstream task evaluations demonstrate the effectiveness and generality of our method. On 9 downstream benchmarks, our pondering-enhanced Pythia models significantly outperform the official Pythia models. Notably, our PonderPythia models demonstrate remarkable effectiveness: PonderPythia-2.8B surpasses Pythia-6.9B and rivals Pythia-12B, while our PonderPythia-1B matches TinyLlama-1.1B, a model trained on 10 times more data. The code is available at https://github.com/LUMIA-Group/PonderingLM.
翻译:人类在表达复杂句子成分前会进行深度思考,通过集中精力实现更深层次的认知处理。在本研究中,我们将这种深度思考过程引入语言模型,通过在单个标记生成步骤中重复调用前向过程来实现。在思考过程中,模型并非从预测分布中采样生成实际标记,而是根据预测的标记分布生成所有标记嵌入的加权和。生成的嵌入随后作为输入反馈给下一次前向传递。我们证明,通过自监督学习,模型能够学会以这种方式进行深度思考,且无需任何人工标注。在三种广泛使用的开源架构(GPT-2、Pythia和LLaMA)上的实验以及广泛的下游任务评估,证明了我们方法的有效性和普适性。在9个下游基准测试中,我们增强思考能力的Pythia模型显著优于官方Pythia模型。值得注意的是,我们的PonderPythia模型表现出卓越的有效性:PonderPythia-2.8B超越了Pythia-6.9B并与Pythia-12B相媲美,而我们的PonderPythia-1B则与训练数据量多10倍的TinyLlama-1.1B性能相当。代码可在https://github.com/LUMIA-Group/PonderingLM获取。