Chain-of-Thought (CoT) prompting has significantly advanced task-solving capabilities in natural language processing with large language models. Unlike standard prompting, CoT encourages the model to generate intermediate reasoning steps, non-answer tokens, that help guide the model toward more accurate final outputs. These intermediate steps enable more complex reasoning processes such as error correction, memory management, future planning, and self-reflection. However, applying CoT to non-natural language domains, such as protein and RNA language models, is not yet possible, primarily due to the limited expressiveness of their token spaces (e.g., amino acid tokens). In this work, we propose and define the concept of language expressiveness: the ability of a given language, using its tokens and grammar, to encode information. We show that the limited expressiveness of protein language severely restricts the applicability of CoT-style reasoning. To overcome this, we introduce reflection pretraining, for the first time in a biological sequence model, which enables the model to engage in intermediate reasoning through the generation of auxiliary "thinking tokens" beyond simple answer tokens. Theoretically, we demonstrate that our augmented token set significantly enhances biological language expressiveness, thereby improving the overall reasoning capacity of the model. Experimentally, our pretraining approach teaches protein models to self-correct and leads to substantial performance gains compared to standard pretraining.
翻译:思维链(CoT)提示已显著推进了大型语言模型在自然语言处理任务中的解决能力。与标准提示不同,CoT鼓励模型生成中间推理步骤(非答案令牌),这些步骤有助于引导模型产生更准确的最终输出。这些中间步骤支持更复杂的推理过程,如错误校正、记忆管理、未来规划和自我反思。然而,将CoT应用于非自然语言领域(如蛋白质和RNA语言模型)目前尚不可行,主要原因是其令牌空间(例如氨基酸令牌)的表达能力有限。在本研究中,我们提出并定义了语言表达力的概念:即给定语言使用其令牌和语法编码信息的能力。我们证明蛋白质语言的有限表达力严重制约了CoT式推理的适用性。为突破此限制,我们首次在生物序列模型中引入反射预训练,使模型能够通过生成超越简单答案令牌的辅助“思考令牌”进行中间推理。理论上,我们证明增强的令牌集显著提升了生物语言表达力,从而改善了模型的整体推理能力。实验表明,相较于标准预训练,我们的预训练方法能教导蛋白质模型进行自我校正,并带来显著的性能提升。